Springer Protocols 


Computational 
Biology and 

Machine Learning for 
Metabolic Engineering 
and Synthetic Biology 


RW 
/ MOREMEDIA >) 3,< Humana Press 


METHODS IN MOLECULAR BIOLOGY 


Series Editor 
John M. Walker 
School of Life and Medical Sciences 
University of Hertfordshire 
Hatfield, Hertfordshire, UK 


For further volumes: 
http://www.springer.com/series/7651 


For over 35 years, biological scientists have come to rely on the research protocols and 
methodologies in the critically acclaamed Methods in Molecular Biology series. The series was 
the first to introduce the step-by-step protocols approach that has become the standard in all 
biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by- 
step fashion, opening with an introductory overview, a list of the materials and reagents 
needed to complete the experiment, and followed by a detailed procedure that is supported 
with a helpful notes section offering tips and tricks of the trade as well as troubleshooting 
advice. These hallmark features were introduced by series editor Dr. John Walker and 
constitute the key ingredient in each and every volume of the Methods in Molecular Biology 
series. Tested and trusted, comprehensive and reliable, all protocols from the series are 
indexed in PubMed. 


Computational Biology 
and Machine Learning 
for Metabolic Engineering 
and Synthetic Biology 


Edited by 
Kumar Selvarajoo 


Bioinformatics Institute, Agency for Science, Technology & Research, Singapore, Singapore 


RW 
3,< Humana Press 


Editor 

Kumar Selvarajoo 
Bioinformatics Institute, 
Agency for Science, 
Technology & Research 
Singapore, Singapore 


ISSN 1064-3745 ISSN 1940-6029 (electronic) 
Methods in Molecular Biology 
ISBN 978-1-0716-2616-0 ISBN 978-1-0716-2617-7 (eBook) 


https: //doi.org/10.1007/978-1-0716-2617-7 


© The Editor(s) (ifapplicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part 
of Springer Nature 2023 

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part 
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, 
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and 
retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter 
developed. 

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, 
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations 
and therefore free for general use. 

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to 
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, 
expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been 
made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 


This Humana imprint is published by the registered company Springer Science+Business Media, LLC part of Springer 


Nature. 
The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A. 


Preface 


Living systems display highly complex dynamic behaviors and can demonstrate self- 
organizing emergent properties for better survival. Thus, using traditional highly reduc- 
tionist or wet bench approaches alone are insufficient. Systems Biology has emerged as a new 
field integrating biological experimentation with theoretical concepts adopted from physics, 
mathematics, statistics, and computer science. More recently, computational modeling, 
artificial intelligence, and data analytics are gaining acceptance for predictive biological 
research. 

For metabolic engineering and synthetic biology, the use of cross-disciplinary techni- 
ques to analyze complex and high-throughput biological data is now widely adopted for 
designing and producing microbes for optimization of valuable specific yields. For support- 
ing disease investigation and interventions, computational and machine learning approaches 
are also gaining traction. 

In this book, I have carefully drawn individual chapters that provide protocols for 
computational, statistical, and machine learning methods, applied largely for metabolic 
engineering and synthetic biology, and two chapters for disease applications. The authors 
are well established scientists in their respective fields. These approaches will support the 
current progress in cross-disciplinary research that is widely discussed and explored as the 
next step for integrating the different scales of biological complexity. 

Geared toward researchers with limited molecular engineering and computational 
analytical or modeling experience, the book provides a broad overview of the subject and 
detailed instructions in computational and machine learning approaches. The text is written 
in simple technical manner as an introduction for physicists, chemists, computer scientists, 
and biologists who are interested to understand how basic to advanced computational 
biology and machine learning methods are adopted for metabolic engineering, synthetic 
biology, and disease modeling research. 

Chapter 1 by Daboussi and Lindley begins with the general overview of the field of 
metabolic engineering and the challenges facing industrial biotechnology applications, 
especially for high-value products. In the following chapter, Chang and colleagues provide 
a brief historical background to synthetic biology, which goes on to succinctly summarize 
the recent machine learning activities as applied to diverse synthetic biology applications. 

In Chap. 3, Gilliot and Gorochowski report a computational model that can analyze and 
predict massively parallel reporter assays (MPRAs) experiments, based on fluorescence- 
activated cell sorting procedures. Andre and colleagues, in Chap. 4, provide computational 
methods to investigate and build molecular assemblies of proteins, which can then be used 
to predict structural models of the protein partners and, using coevolution information, 
search for interacting regions. Smith introduces a new bioinformatics pipeline of genome 
mining to structural protein engineering in Chap. 5. 

Next, Logel and Jaschke, in Chap. 6, provide a new workflow with Markov Hidden 
Models to create synthetic overlaps between two proteins, by protecting the engineered 
coding sequences from mutation or loss of function. This is followed by a Boolean Logic 
model demonstration of convergent promotor for synthetic biology applications by Abraha 
and Marchisio (Chap. 7). Fundamentally a similar digital approach was taken by Guizio and 
colleagues for the design of recombinase logic circuits in Chap. 8. 


V 


vi Preface 


Focusing on lab to industrial scale bioprocessing, Yeoh and Poh describe an integrated 
modeling workflow using experimental dataset with kinetic modeling approach and compu- 
tational fluid dynamics (Chap. 9). Following, Collins and colleagues present a kinetic model 
using spatial information to study the subcellular recruitment of optogenetic protein to 
plasma membrane for synthetic biology applications (Chap. 10). For engineering a microbe 
Vibrio natriegens to produce of 1,3-propanediol, Zhen and colleagues show genome-scale 
metabolic models and genome editing protocols as crucial workflows in Chap. 11. 

In Chap. 12, Helmy and Selvarajoo present a pipeline for a rigorous transcriptomics data 
analytics for synthetic biology applications, while Gendoo demonstrates a suite of bioinfor- 
matics software and databases that are very useful for metabolic engineering (Chap. 13). 
Sugimoto presents a three-dimensional mathematical model and protocol, in Chap. 14, that 
incorporates the spouting and branching events in angiogenesis and tumor growth in 
cancers. 

The final five chapters, Chapters 15, 16, 17, 18, and 19, describe machine learning 
methods for diverse applications from metabolic pathway (Bonetta and colleagues, 
Cuperlovic-Culf and colleagues), omics (Niranjan and colleagues) and disease (Occhipinti 
and colleagues) analyses as well as to elucidating protein interaction networks (Sundar and 
colleagues). 

I believe the chapters presented in the book will be useful for all readers to grasp the 
general trend of modern computational methods applied to understand and predict com- 
plex biology. 


Bioinformatics Institute, Kumar Selvarajoo 
Agency for Science, 

Technology & Research, 

Singapore, Singapore 


Contents 


PHO CC die Soci Bie 8%, hoe He aig ae Bag BEG e Geo nS Gea NSS Sale ois actress Gach a above SRSA ey SSN v 
GONMBOULONS cs ous Ue sseasa tn aly ae uly. nae thes seeds ROOMS Leeann yer eelees HE 1x 
1 Challenges to Ensure a Better Translation of Metabolic Engineering 


10 


11 


12 


for Indastnial ApplicaniOng i oo54 sen serkacee ei es Ree bie ede eeye nies eeeas ee l 
Fayza Daboussi and Nic D. Lindley 
Synthetic Biology Meets Machine Learning ................ 00... c eee ee eee 21 


Brendan Fu-Long Siecow, Ryan De Sotto, Zhi Ren Darren Seet, 
In Young Hwang, and Matthew Wook Chang 


Design and Analysis of Massively Parallel Reporter Assays 
Wiis BOP BUSS Lit creo akt chee ee des eee ep ate a 41 
Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Modeling Protein Complexes and Molecular Assemblies 

Using Computational Meinods:..035.0:cdasretbarerseewiddinorbbierdavekes 57 
Romain Launay, Elin Teppa, Jérémy Esque, 

and Isabelle André 

From Genome Mining to Protein Engineering: A Structural 


BiOimulonmgues ROWE 6 6.35 ks ebaw loswhadew Sou rb adwdoaes toes ebeegeadGew ans 79 
Derek J. Smith 


Creating De Nove Overlapped Gene i.4 6c oinic diese coshenvnaree to seuuerds 05 
Dominic Y. Logel and Paul R. Jaschke 


Design of Gene Boolean Gates and Circuits with Convergent Promoters ...... 121 
Biruck Woldai Abraha and Mario Andrea Marchisio 


Computational Methods for the Design of Recombinase Logic Circuits 
with Adaptable Circuit Specifications... iid civiseessaceviawwenr eyes Sear sen 155 
Ana Zuniga, Jérome Bonnet, and Sarah Guiziou 


Designing a Model-Driven Approach Towards Rational Experimental 
Design in Bioprocess Optimization ......26566rkaces cee neasecrbaensdennees 173 
Jing Wui Yeoh and Chueh Loo Poh 


Modeling Subcellular Protein Recruitment Dynamics 

Tor OU hens RIGWOt: ocn5 tesla oan hath aGatdagredad we ede ke RiGee eue 189 
Kwabena A. Badu-Nkansah, Diana Sernas, Dean E. Natwick, 

and Sean R. Collins 


Genome-Scale Modeling and Systems Metabolic Engineering 
of Vibrio natriegens for the Production of 1,3-Propanediol .................. 209 
Ye Zhang, Dehua Lin, and Zhen Chen 


Application of GeneCloudOmics: Transcriptomic Data Analytics 
for Sy ethebe Wisi snakes dea edtevietect abet eedeeeudasyius medeades 221 
Mohamed Helmy and Kumar Selvarajoo 


vii 


viii Contents 


13 Overview of Bioinformatics Software and Databases 


for Metabolic Rngiieenne ios is ies4tnsad Gin eineowsa de een plop stuns aeican 265 
Deena M. A. Gendoo 
14 Computational Simulation of Tumor-Induced Angiogenesis ................. 275 


Masahiro Sugimoto 


15 Computational Methods and Deep Learning for Elucidating Protein 
ber sCClOih INGO GIRS .c5 secre en SLD Gute outa Neat ceanwse baa ewenbews 285 
Dhvani Sandip Vora, Yogesh Kalakoti, and Durai Sundar 


16 Machine Learning Methods for Survival Analysis with Clinical 
and Transcriptomics Data of Breast Cancer 0.06 c2cccnawcce is deauncrgaeen es 325 
Le Minh Thao Doan, Claudio Angione, and Annalisa Occhipinti 


17 Machine Learning Using Neural Networks for Metabolomic 
PRAY APICES 55.6 550g 04d 6 RSH R EL Oge hs SRR ORG EKMAES OES eA DEKH 395 
Rosalin Bonetta Valentino, Jean-Paul Ebejer, 
and Gianluca Valentino 


18 Machine Learning and Hybrid Methods for Metabolic 
Way CN ees oie se kh ee eee SEA ee Te eee eke aTa eS 417 
Miroslava Cuperlovic-Culf, Thao Nguyen-Tran, 
and Steffany A. L. Bennett 

19 A Machine Learning-Based Approach Using Multi-omics Data 
toe Predict Metabolic Pathways. csccccciceacibiseseveverbesseidagate sented. 44] 
Vidya Niranjan, Akshay Uttarkar, Aakaanksha Kaul, 
and Maryanne Varghese 


Contributors 


Brruck Wo pal ABRAHA « School of Pharmaceutical Science and Technology, Tianjin 
University, Tianjin, China 

IsABELLE ANDRE + Toulouse Biotechnology Institute, TBI, Université de Toulouse, CNRS, 
INRAE, INSA, Toulouse Cedex 04, France 

C1LauDIO ANGIONE + School of Computing, Engineering and Digital Technologies, Teesside 
University, Middlesbrough, UK; Centre for Digital Innovation, Teesside University, 
Middlesbrough, UK; Healthcare Innovation Centre, Teesside University, Middlesbrough, 
UK; National Horizons Centre, Teesside University, Darlington, UK 

Kwasena A. BADU-NKANSAH « Department of Microbiology and Molecular Genetics, 
University of California, Davis, Davis, CA, USA 

STEFFANY A. L. BENNETT « Department of Biochemistry, Microbiology, and Immunology, 
University of Ottawa, Ottawa, ON, Canada; Neural Regeneration Laboratory, Ottawa 
Institute of Systems Biology, Brain and Mind Research Institute, University of Ottawa, 
Ottawa, ON, Canada; Department of Chemistry and Biomolecular Sciences, Centre for 
Catalysis Research and Innovation, University of Ottawa, Ottawa, ON, Canada 

RosaLIN BONETTA VALENTINO « Barts and the London School of Medicine and Dentistry, 
Queen Mary University of London, Victoria, Malta 

JEROME BONNET « Centre de Biologie Structurale (CBS), Univ. Montpellier, INSERM 
U1054, CNRS UMR5048, Montpellier, France 

MatrHew Wook CHANG « NUS Synthetic Biology for Clinical and Technological Innovation 
(SynCTI), National University of Singapore, Singapore, Singapore; Synthetic Biology 
Translational Research Programme, Yong Loo Lin School of Medicine, National University 
of Singapore, Singapore, Singapore; Department of Biochemistry, Yong Loo Lin School of 
Medicine, National University of Singapore, Singapore, Singapore 

ZHEN CHEN + Key Laboratory for Industrial Biocatalysis (Ministry of Education), 
Department of Chemical Engineering, Tsinghua University, Beying, China; Tsinghua 
Innovation Center in Dongguan, Dongguan, China; Center for Synthetic and Systems 
Biology, Tsinghua University, Beijing, China 

SEAN R. Cotiins » Department of Microbiology and Molecular Genetics, University of 
California, Davis, Davis, CA, USA 

MrrosLavA CUPERLOVIC-CULE « Digital Technologies Research Centre, National Research 
Council of Canada, Ottawa, ON, Canada; Department of Biochemistry, Microbiology, 
and Immunology, University of Ottawa, Ottawa, ON, Canada 

Fayza Dazouss! « Toulouse White Biotechnology, Toulouse cedex 4, France; Toulouse 
Biotechnology Institute, Toulouse cedex 4, France 

Ryan De Sotro « NUS Synthetic Biology for Clinical and Technological Innovation 
(SynCTI), National University of Singapore, Singapore, Singapore; Synthetic Biology 
Translational Research Programme, Yong Loo Lin School of Medicine, National University 
of Singapore, Singapore, Singapore; Department of Biochemistry, Yong Loo Lin School of 
Medicine, National University of Singapore, Singapore, Singapore 


ix 


Xx Contributors 


Le Minu THao Doan .« School of Computing, Engineering and Digital Technologies, Teesside 
University, Middlesbrough, UK 

JEAN-PAUL EBEJER « Centre for Molecular Medicine and Biobanking, University of Malta, 
Msida, Malta 

Jeremy EsQue « Toulouse Biotechnology Institute, TBI, Université de Toulouse, CNRS, 
INRAE, INSA, Toulouse Cedex 04, France 

DEENA M. A. GENDOO «- Centre for Computational Biology, Institute of Cancer and 
Genomic Sciences, University of Birmingham, Birmingham, United Kingdom; Institute of 
Cancer and Genomic Sciences, University of Birmingham, Birmingham, United Kingdom 

PreRRE-AURELIEN GILLIOT « School of Biological Sciences, University of Bristol, Bristol, UK 

Tuomas E. GorocHowsk! « School of Biological Sciences, University of Bristol, Bristol, UK 

SARAH Guiziou +» Department of Biology, University of Washington, Seattle, WA, USA 

MouwameD HeEtmy « Bioinformatics Institute, Agency for Science, Technology and Research 
(A*STAR), Singapore, Singapore; Department of Computer Science, Lakehead University, 
Thunder Bay, ON, Canada 

In YouNG Hwane - NUS Synthetic Biology for Clinical and Technological Innovation 
(SynCTI), National University of Singapore, Singapore, Singapore; Synthetic Biology 
Translational Research Programme, Yong Loo Lin School of Medicine, National University 
of Singapore, Singapore, Singapore; Department of Biochemistry, Yong Loo Lin School of 
Medicine, National University of Singapore, Singapore, Singapore 

Pau R. JASCHKE + School of Natural Sciences, Macquarie University, Sydney, NSW, 
Australia; ARC Centre of Excellence in Synthetic Biology, Macquarie University, Sydney, 
NSW, Australia 

YOGESH KataKkoT! « Department of Biochemical Engineering and Biotechnology, Indian 
Institute of Technology Delli, Hauz Khas, New Delhi, India 

AAKAANKSHA KauL + Department of Biotechnology, R V College of Engineering, Mysuru 
Road, Kengeri, Bengaluru, India 

RoMAIN Launay « Toulouse Biotechnology Institute, TBI, Université de Toulouse, CNRS, 
INRAE, INSA, Toulouse Cedex 04, France 

Nic D. Linprey - Toulouse White Biotechnology, Toulouse cedex 4, France; Toulouse 
Biotechnology Institute, Toulouse cedex 4, France; ASTAR Singapore Institute of Food and 
Biotechnology Innovation (SIFBI), Singapore, Singapore 

Denua Liu - Key Laboratory for Industrial Biocatalysis (Ministry of Education), 
Department of Chemical Engineering, Tsinghua University, Beyjing, China; Tsinghua 
Innovation Center in Dongguan, Dongguan, China; Center for Synthetic and Systems 
Biology, Tsinghua University, Beijing, China 

Dominic Y. Locet « School of Natural Sciences, Macquarie University, Sydney, NSW, 
Australia; ARC Centre of Excellence in Synthetic Biology, Macquarie University, Sydney, 
NSW, Australia 

Mario ANDREA Marcuisio + School of Pharmaceutical Science and Technology, Tianjin 
University, Tianjin, China 

DEAN E. Natwick « Department of Microbiology and Molecular Genetics, University of 
California, Davis, Davis, CA, USA 


Contributors xi 


THAO NGUYEN-TRAN +» Department of Biochemistry, Microbiology, and Immunology, 
University of Ottawa, Ottawa, ON, Canada; Neural Regeneration Laboratory, Ottawa 
Institute of Systems Biology, Brain and Mind Research Institute, University of Ottawa, 
Ottawa, ON, Canada; Department of Chemistry and Biomolecular Sciences, Centre for 
Catalysis Research and Innovation, University of Ottawa, Ottawa, ON, Canada 

Vipya NIRANJAN + Department of Biotechnology, R V College of Engineering, Mysuru Road, 
Kengeri, Bengaluru, India 

ANNALISA OCCHIPINTI « School of Computing, Engineering and Digital Technologies, Teesside 
University, Middlesbrough, UK; Centre for Digital Innovation, Teesside University, 
Middlesbrough, UK; National Horizons Centre, Teesside University, Darlington, UK 

CHUEH Loo Pou - Department of Biomedical Engineering, College of Design and 
Engineering, National University of Singapore, Singapore, Singapore; NUS Synthetic 
Biology for Clinical and Technological Innovation (SynCTI), National University of 
Singapore, Singapore, Singapore 

ZHI REN DarrEN SEET « NUS Synthetic Biology for Clinical and Technological Innovation 
(SynCTI), National University of Singapore, Singapore, Singapore; Synthetic Biology 
Translational Research Programme, Yong Loo Lin School of Medicine, National University 
of Singapore, Singapore, Singapore; Department of Biochemistry, Yong Loo Lin School of 
Medicine, National University of Singapore, Singapore, Singapore 

Kumar SELVARAJOO + Bioinformatics Institute, Agency for Science, Technology and Research 
(A*STAR), Singapore, Singapore; Singapore Institute of Food and Biotechnology 
Innovation, Agency for Science, Technology and Research (A*STAR), Singapore, 
Singapore; NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), 
National University of Singapore, Singapore, Singapore; School of Biological Sciences, 
Nanyang Technological University, Singapore, Singapore 

DIANA SERNAS « Department of Microbiology and Molecular Genetics, University of 
California, Davis, Davis, CA, USA 

BRENDAN Fu-Lonc Stteow - NUS Synthetic Biology for Clinical and Technological 
Innovation (SynCTI), National University of Singapore, Singapore, Singapore; Synthetic 
Biology Translational Research Programme, Yong Loo Lin School of Medicine, National 
University of Singapore, Singapore, Singapore; Department of Biochemistry, Yong Loo Lin 
School of Medicine, National University of Singapore, Singapore, Singapore; NUS 
Graduate School for Integrative Sciences and Engineering Programme, National 
University of Singapore, Singapore, Singapore 

DEREK J. SMITH « Singapore Institute for Food and Biotechnology Innovation (SIFB1), 
Singapore, Singapore 

Masauiro Sucimoto + Institute of Medical Science, Tokyo Medical University, Shinjuku, 
Tokyo, Japan; Institute for Advanced Biosciences, Keio University, Tsuruoka, Yamagata, 

japan 

Dural SUNDAR + Department of Biochemical Engineering and Biotechnology, Indian 
Institute of Technology Delln, Hauz Khas, New Delhi, India; School of Artificial 
Intelligence, Indian Institute of Technology Delln, Hauz Khas, New Delhi, India 

ELIN Tepra + Toulouse Biotechnology Institute, TBI, Université de Toulouse, CNRS, INRAE, 
INSA, Toulouse Cedex 04, France 

AKSHAY UTTARKAR + Department of Biotechnology, RV College of Engineering, Mysuru Road, 
Kengeri, Bengaluru, India 

GIANLUCA VALENTINO « Department of Communications and Computer Engineering, 
University of Malta, Msida, Malta 


xii Contributors 


MaAryYANNE VARGHESE « Department of Biotechnology, RV College of Engineering, Mysuru 
Road, Kengeri, Bengaluru, India 

DuvanI SANDIP VorA « Department of Biochemical Engineering and Biotechnology, Indian 
Institute of Technology Delln, Hauz Khas, New Delhi, India 

Jinc Wu YEou - Department of Biomedical Engineering, College of Design and 
Engineering, National University of Singapore, Singapore, Singapore; NUS Synthetic 
Biology for Clinical and Technological Innovation (SynCTI), National University of 
Singapore, Singapore, Singapore 

YE ZHANG » Key Laboratory for Industrial Biocatalysis (Ministry of Education), Department 
of Chemical Engineering, Tsinghua University, Beijing, China 

ANA ZUNIGA « Centre de Biologie Structurale (CBS), Univ. Montpellier, INSERM U1054, 
CNRS UMR5048, Montpellier, France 


Check for 
updates 


Challenges to Ensure a Better Translation of Metabolic 
Engineering for Industrial Applications 


Fayza Daboussi and Nic D. Lindley 


Abstract 


Metabolic engineering has evolved towards creating cell factories with increasingly complex pathways as 
economic criteria push biotechnology to higher value products to provide a sustainable source of speciality 
chemicals. Optimization of such pathways often requires high combinatory exploration of best pathway 
balance, and this has led to increasing use of high-throughput automated strain construction platforms or 
novel optimization techniques. In addition, the low catalytic efficiency of such pathways has shifted 
emphasis from gene expression strategies towards novel protein engineering to increase specific activity of 
the enzymes involved so as to limit the metabolic burden associated with excessively high pressure on 
ribosomal machinery when using massive overexpression systems. Metabolic burden is now generally 
recognized as a major hurdle to be overcome with consequences on genetic stability but also on the 
intensified performance needed industrially to attain the economic targets for successful product launch. 
Increasing awareness of the need to integrate novel genetic information into specific sites within the 
genome which not only enhance genetic stability (safe harbors) but also enable maximum expression 
profiles has led to genome-wide assessment of best integration sites, and bioinformatics will facilitate the 
identification of most probable landing pads within the genome. 

To facilitate the transfer of novel biotechnological potential to industrial-scale production, more atten- 
tion, however, has to be paid to engineering metabolic fitness adapted to the specific stress conditions 
inherent to large-scale fermentation and the inevitable heterogeneity that will occur due to mass transfer 
limitations and the resulting deviation away from ideal conditions as seen in laboratory-scale validation of 
the engineered cells. To ensure smooth and rapid transfer of novel cell lines to industry with an accelerated 
passage through scale-up, better coordination is required form the onset between the biochemical 
engineers involved in process technology and the genetic engineers building the new strain so as to have 
an overall strategy able to maximize innovation at all levels. This should be one of our key objectives when 
building fermentation-friendly chassis organisms. 


Key words Cell factory, Industrial fermentation, Genetic stability, Pathway optimization, Biotechnol- 
ogy, Specialty chemicals 


1. Introduction 


When contacted to prepare this chapter, our initial thoughts were 
to prepare an overview of all the positive advances that have been 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_1, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


1 


2 Fayza Daboussi and Nic D. Lindley 


1.1 Historical 
Prospective 


made to generate new engineered microbes. However, this is a 
mammoth undertaking and would probably justify an entire book 
rather than a simple introductory chapter. Indeed, many of the 
other chapters will deal with just such updates and showcase some 
of the successful achievements made to date. Instead we have tried 
to look at why much of the extraordinary potential of metabolic 
engineering has not always been successfully translated to industry 
and how the challenges have evolved as metabolic targets shift from 
relatively simple molecules to more complex high-value metabolites 
which we believe are going to be increasingly important. Indeed, a 
recent opinion paper [1] gave some examples of current success 
stories issued from the synthetic biology approach to metabolic 
engineering and some ideas as to what the future might hold. A 
few years back, we might optimistically have stated that “the sky is 
the limit,” but some of the ideas go beyond this stratospheric 
ceiling and deal with concepts that could facilitate sustainable 
space travel. Despite these success stories and a growing number 
of products coming out of the pipeline, many potentially exciting 
strains continue to struggle in spanning the key titer, rate, and yield 
(TRY) levels between feasibility study in the research laboratory 
and industry. We will try and identify where some of the current 
bottlenecks are situated in driving such studies to industrial 
exploitation as this needs to be better understood by those working 
outside industry. Obviously this overview cannot be complete and 
serves more to illustrate some areas which the growing synthetic 
biology community might want to invest in and accelerate the 
translation of promising research into realistic industrial 
exploitation. As the world awaits sustainable solutions to meet the 
requirements of consumers, an unprecedented opportunity awaits 
us, and we have to ensure that the impressive accumulation of new 
knowledge can be translated into biotechnological answers to this 
demand. 


Synthetic biology is quite a vast domain covering a whole variety of 
topics, but we have voluntarily restricted this to what used to be 
covered by the term metabolic engineering or the rational engi- 
neering of the microbial genome to generate high-performance 
microbes which fall into the cell factory concept for economically 
interesting conversion of simple feedstock into desirable metabo- 
lites which was initiated more than 40 years ago [2, 3]. As many of 
us active in this domain appreciate, the rational adjective is often the 
limit to this statement and often a weak spot in metabolic engineer- 
ing which has turned to alternative high-throughput logic to 
overcome the incomplete understanding of how microbes function 
[4, 5] and more precisely how such functionality will be modified 
by adding new biochemical pathways [6] to the existing metabolic 
network. Much of the early progress in metabolic engineering 
covered the upgrading of natural pathways to enable efficient 


Translating Metabolic Engineering to Industry 3 


accumulation of the desired product using genetic tools to modify 
the pathway composition and regulation but fundamentally was 
often optimizing pathways that already existed in the cells or 
needed relative modest additions to deviate a pathway to an 
alternative product [7-9]. This led to a significant number of 
applications covering simple molecules such as amino acids, 
vitamins, organic acids, and solvents as well as some of the key 
biopolymers. Details of how this was achieved can be found in 
any of the reviewers which have appeared on a regular basis. One 
of the key problems to be faced was conversion efficiency, because 
chemical synthesis from fossil fuels was often easier to industrialize 
leading to the situation in which profitability was dependent on the 
relative cost of petrochemical and sugar feedstocks and the conver- 
sion factors involved. Looking into cheap feedstocks has also helped 
drive this push and avoids competition with the requirement for 
food for the classical fermentation substrates. 

Recently, many of these metabolic engineering strategies have 
hit economic difficulties despite some excellent scientific progress 
being made. While this varies from application domain to another, 
the current situation does not favor the biotechnological produc- 
tion of simple bulk chemicals except when the biological process 
offers some key advantages such as avoiding toxic waste products. 
The strong competition from chemical synthesis using fossil 
resources has, however, generated very strong pressure to improve 
yields and final concentrations, and we now have an increasing list 
of molecules that can be produced with yields close to the theoreti- 
cal maxima and in concentrations often exceeding 100 g/L. 
Despite this we still suffer in process competitiveness, though this 
could of course change in the intervening years as climate change 
criteria become increasingly important in a carbon neutral vision of 
the world’s industrial economy. To put this into perspective, 
however, you have to take into account that a huge majority of 
petrochemicals are used for energy production (about 75% of out- 
put) in one form or another, and only a relatively small amount is 
used for a highly diverse list of chemical synthons used to produce 
several thousand products [10]. Replacing any single product 
outside the energy domain is therefore unlikely to have a huge 
effect on climate considerations. 

The metabolic engineering community has switched its logic to 
react to this situation. Some moved towards the domain of systems 
biology aware that the underpinning knowledge base was often a 
limiting factor in the pragmatic engineering strategies that had 
attracted them initially to the metabolic engineering domain 
[6]. In parallel, we also saw a consolidated movement towards 
high-throughput automated platforms (biofoundries) coming 
into play [11] so that the rational knowledge-based optimization 
could effectively be replaced or complimented by a more empirical 
investigation of the unknown phenomena that remained obscure 


4 


Fayza Daboussi and Nic D. Lindley 


and hence difficult to engineer. We have seen increasingly sophisti- 
cated platforms which can upgrade our strain engineering capacity 
quite remarkably so that today we can explore multiple possibilities 
in a timeline which was previously not even a possibility in our 
wildest dreams. The challenge here is not only to capture the best 
constructs which match with our application-driven targets but also 
to capture the wealth of information that is hidden in the strains 
which do not perform as we would hope. This is typified by the 
impressive facilities available to some of the private synthetic 
biology companies exploring new product development such as 
Amyris, Zymogen, and Ginkgo Bioworks who have integrated 
advanced bioinformatics with fully automated strain construction 
platforms with the capacity to rapidly optimize novel metabolic 
pathways, though this development is restricted to a relatively 
small number of workhorse chassis organisms. Coupled to some 
of the machine learning technology, this abundance of data will 
point us more rationally to some of the engineering which is today 
still rather empirical and help build a systems biology knowledge 
base to better predict most probable strain constructs 
[12, 13]. How we move in that direction will depend probably on 
our capacity to integrate automatic data handling and metabolic 
modeling tools to extract the hidden meaning from our “failed” 
constructs. 

Accompanying these changes was a movement towards 
engineering strains able to produce more complex secondary meta- 
bolites with high added value. The appearance of the term synthetic 
biology acknowledged the fact that today we are not simply 
engineering known established pathways but actually looking at 
novel artificial pathways and entirely novel regulatory control 
mechanisms in what we might term synthetic biodiversity. This 
shift towards more complex biochemical pathways of course brings 
its own specific problems and some new paradigms to be resolved 
which will ultimately need those of us working in this domain to 
expand our toolbox to meet these challenges. Many of the 
references cited above cover some of these aspects. At a moment 
where the industrial world is awaiting this innovation, we need to 
be sure that what we are developing in the laboratory can be 
translated to the world of industrial fermentation in which the 
microbes are often subject to quite extreme growth conditions 
and resulting metabolic stress building on the specific stress of 
having their metabolic topology modified quite considerably. All 
too often our highly tuned and sensitive microbes fail to perform 
when placed in the industrial context, and this is harmful for our 
credibility as few domains are so closely tied to the application 
potential as metabolic engineering. 


Translating Metabolic Engineering to Industry 5 


We have the expertise, but maybe we need to be selective in 
how we use this toolbox to generate strains which are more robust 
and able to deal with the consequences of metabolic burden 
[14]. Are synthetic biology and the specific domain of metabolic 
engineering failing to live up to its promise and anchored predomi- 
nantly as an academic exercise to showcase some elegant and rather 
sophisticated reprogramming of microbial metabolism? We would 
like to give a categorical negative reply to this, but the truth is 
probably less clear-cut. A lot of promising strains still fail to perform 
as expected when translated to realistic industrial fermentation 
conditions. Let’s look at some of the aspects which are currently 
bottlenecks to achieving the type of performance that is needed to 
make our processes economically viable and scalable to large-scale 
industrial fermentation applications. 

First of all, however, let us dispel the idea that GMO technol- 
ogies are a problem: our arsenal of therapeutic molecules is today 
predominantly from genetic engineering which has taken over 
largely from chemical synthesis as the therapeutic targets become 
increasingly large molecules. Likewise, the biofuels and substitu- 
tion logic for compounds used by the speciality chemicals industry, 
currently derived predominantly from non-sustainable petrochem- 
icals, has been driven largely by genetically engineered high- 
performance strains. Perhaps the food industry is still recalcitrant 
to GMO foods, but in many cases, we will be producing ingredients 
which suffer no such labeling restrictions as long as such products 
are DNA-free and demonstrated to be safe. One might point out 
that the quasi-totality of all the enzymes we employ in the food 
industry [15] is being produced using high-performance chassis 
organisms which have been engineered to attain enzyme concen- 
trations which can reach concentrations of more than 100 g/L and 
quite regularly surpass 50 g/L. To this technical enzyme market, 
we can also see major innovation to boost non-animal dairy pro- 
ducts and other animal-derived protein sources, driven by animal 
welfare and sustainability concerns. Of course therapeutic proteins 
are exclusively produced using engineered chassis organisms 
whether they be microbial plant or animal cell lines [16]. However, 
it is probably in the domain of speciality ingredients that market 
opportunities are highest with a shift of consumers to biosourced 
supply chains. The cellular content of such metabolites in plants 
tends to be very low, and it is debatable today if land use should be 
dedicated to such crops when engineered microbes can produce 
concentrations several orders of magnitude higher than found in 
plants and reproduce exactly the same stereochemistry (often a 
major influence on the flavor and fragrance profiles or biological 
activity) in these nature-identical molecules [17]. 


6 Fayza Daboussi and Nic D. Lindley 


Typical volumes (T) 
Inverse scale 


10° 


10° 


Low volume, high margin 


High-value 
ingredients 


for food, 
nutrition & 


Commodity personal care 


chemicals 


High volume, low margin 


Typical price 
1 10 100 1000 10000 (S/kg) 
Requires cheap Less dependent on 
feedstock cheap feedstock 


Fig. 1 Relationship between market volumes and price of products 


1.2 Economic 
Constraints and a 
Requirement for 
Underpinning 
Multidisciplinary 
Outlook 


If you look at the cost structure of bioprocessing, there is a general 
inverse correlation between the market price and the market vol- 
ume (Fig. 1), but this also reflects cost structure in the production 
process. With bulk chemicals, volumes are high and market values 
are quite low, so efficient conversion is essential, and a real effort is 
needed to exploit the cheaper and abundant feedstocks. To be 
economically viable, high conversion efficiency close to the maxi- 
mum theoretical conversion yields is needed, and concentrations in 
the fermenter need to be high to facilitate cost-effective recovery. 
Energy and feedstock are probably the key costs, and this focuses 
the strain engineering on clearly identified phenomena controlling 
flux orientation into the desired synthetic pathway. 

When we look at high-price commodity chemicals, the situa- 
tion is rather different, and while these yield and concentration 
factors remain important, they are often no longer the dominant 
factor. The pathways are more complex and yields are often lower 
and the contribution of fermentation to the overall cost decreases 
with an increasing downstream processing cost. This is typified in 
some of the therapeutics in which the fermentation costs may be 
quite a low part of overall costs. The product has to meet purity 
criteria and formulation, and the downstream recovery is often the 
dominant cost. Because of this, the focus of the research is not so 
much to produce more but to produce cleaner such that separation 
costs can be diminished. It is often more important to remove a 
minor contaminant than to produce a few grams per liter more of 
the desired metabolite. 

In this context, using poorly defined complex but cheap feed- 
stocks, essential for bulk chemicals, would only have a marginal 
saving on the fermentation step but could have a disastrous conse- 
quence on downstream processing costs. This pleads for a global 
systemic view on the various limitations that have to be overcome 


Translating Metabolic Engineering to Industry 7 


before the strain engineering begins to fix the framework which has 
to be respected in any strain that is generated [18] and can only 
exist when the strain engineering is planned in close collaborative 
effort with the process engineers who will have to translate this to 
an industrial logic. Anything else reduces the process engineering 
to damage limitation and slows down the transfer. In our view, this 
is one of the key difficulties which is encountered in many academic 
laboratories which tend to be focused either on the biology or on 
the process engineering. 

One of the consequences is that effort is spent in elaborate 
attempts to resolve via genetic engineering, phenomena which 
could be easily avoided by a better understanding of how the 
process could be designed to avoid the problem becoming mani- 
fest. This seems extremely logical, so why is this relatively rare in 
academic research structures while quite frequent in companies? 
This maybe just reflects the way our academic departments have 
been structured over many years with quite a deep divide between 
life science and engineering faculties, often situated quite some 
distance apart on the university campus and not always encouraged 
to work towards common goals. It also reflects what are considered 
to be academic success criteria as compared to what is vital for 
industry and the sequential nature in which biotechnology is 
often planned in which the strain development is usually close to 
completion before we hand this over with all its inherent strengths 
and weaknesses to the next link in the chain who then tries to make 
the best choices to optimize the strain performance in fermentation 
development before then looking at how we would recover the 
product. 

Since a lot of opportunities for innovation depend directly to 
the preceding steps, the innovation space progressively diminishes 
and yet often needs extensive effort to find solutions compatible 
with the intrinsic weaknesses of the work that has gone beforehand. 
This pleads for a more open discussion to fix what are most proba- 
ble requirements downstream of the metabolic engineering and a 
reverse engineering logic in which the strains are designed from the 
onset to be compatible with the process most likely to be used. It is 
our view that adopting such a strategy would have a significant 
effect in accelerating the work as it moves up the TRL scale and 
avoids “back-to-the-drawing-board” situations when strain perfor- 
mance collapses when faced with the process environment. While 
the “Design-Build-Test-Learn” cycle is inherent to synthetic biol- 
ogy, common sense tells us that you would like to limit to a strict 
minimum the number of revolutions within this cycle in order to 
shorten development timelines. This is the underpinning logic 
(Fig. 2) of a pan-European consortium of laboratories (IBISBA) 
that brings together different disciplines in some of the top labora- 
tories to make available a consolidated platform to accelerate the 
penetration of synthetic biology into industrial biotechnology 
applications [19]. 


8 Fayza Daboussi and Nic D. Lindley 


Pan-European Research Infrastructure 


Classical approach 10 to 15-year development 


— 
N 


=> 


y 4 
AP 


3 4 


Integrative approach 4-6-year development 


» 
‘NX 
»> 


, Accelerating industrial biotechnology ' 


Enabling the bioeconomy 


Fig. 2 An integrated multidisciplinary logic to accelerate development timelines from synthetic biology to 
industrial biotechnology as proposed by the European IBISBA Network 


1.3 Optimizing 
Complex Branched 
Secondary Metabolite 
Pathways 


Many of the interesting pathways that lead to high-value speciality 
chemicals are pathways that are intrinsically complex due to the 
molecular structure of the product and frequently involve quite 
promiscuous enzymatic activities which lead to families of mole- 
cules of similar structure but often very different biological activ- 
ities. Furthermore, their expression in natural hosts is often 
regulated by complex and often obscure trigger stimuli such that 
much of the natural biodiversity present in nature is often not 
expressed using classical growth conditions. Today transfer of 
genes into a pre-optimized chassis organism can unlock part of 
this silent biodiversity [20]. 

The enzymatic promiscuity is often seen in plant-based essen- 
tial oils which contain multiple compounds and contribute to the 
overall fragrance of the oil and are intrinsically coupled to the value 
of the oils. However, not all these compounds are safe when the oil 
is used in cosmetics, and this is a cause for concern at the moment 
with REACH risk approval necessary for applications beyond a 
certain quantity in Europe. Simply transferring the pathway as it is 
to a new host might well intensify the production capacity and lead 
to better sustainability, but it does not necessarily modify the 
mixture, and this needs to be addressed to remove any undesirable 
co-products. Fortunately, protein engineering can modulate the 


Translating Metabolic Engineering to Industry 9 


promiscuous nature of these enzymes and focus the pathway on the 
desirable products, often changing the pathway towards the more 
valuable products at the same time [21 ]. 

Expressing such secondary metabolite pathways does of course 
lead to increasing complexity in attaining high yields. The obliga- 
tory pathway flux optimization becomes increasingly difficult as the 
pathway involves more and more reactions, following a geometric 
progression in combinatory expression profiles. Early progress in 
metabolic engineering often involved primary metabolites which 
were closely linked to central metabolism in many cases such that 
modifying the network did not involve a significant protein burden 
and introduction of very few novel genes. It came with its own 
problems as yields were high, and so biomass synthesis was mod- 
ified due predominantly to the fact that carbon and energy fluxes 
were being deviated away from anabolism to non-growth-related 
end products. This could be dealt with quite effectively by decou- 
pling growth from metabolite accumulation phase using fed-batch- 
type strategies. 

These same strategies are often employed to induce the more 
complex, secondary metabolism-derived biosynthetic pathways, 
but consequences for cell fitness are accentuated by the overloading 
of the protein synthetic machinery. Indeed, the logic behind 
growth decay during production is somewhat different as it not 
due to modified carbon flux throughout the metabolic network in 
most cases but the huge requirement for synthesis of high concen- 
trations of enzymes not required for growth. This is compounded 
by the often intrinsically low enzymatic activity of these pathway 
enzymes which have evolved in natural evolution to produce rather 
small amounts under very precise environmental conditions. Cur- 
rently we use sledgehammer tactics to overexpress these activities to 
overcome such limitations and indeed attain performance levels 
much higher than in natural producers. With pathways with typi- 
cally 12-20 reactions at minimum, such overexpression monopo- 
lizes a significant part of the ribosomal protein synthesis machinery. 

The ribosomes can be considered like any catalytic reaction in 
biological systems in which multiple substrates (mRNA transcripts) 
with different affinities (RBS sequences) and ribosomal loading 
onto the mRNA determine the rate of protein synthesis and 
hence the intracellular concentration of each reaction. All mRNA 
species compete for access to ribosomes, and ultimately the global 
protein output reflects this open system. Sudden and massive 
induction of novel proteins which contribute a significant part of 
the overall mRNA population has inevitable consequences on the 
synthesis of cellular proteins and will provoke a diminished capacity 
of the cells to develop and multiply. 


10 


Fayza Daboussi and Nic D. Lindley 


In other words, the expression of a non-growth-associated 
pathway will automatically slow down the rate at which growth 
essential proteins can be synthesized. Of course this depends also 
on the mRNA decay rates which also contribute to the active 
cellular concentrations and which can now to some extent be 
engineered [22]. One might also consider genome streamlining 
as a means to remove some of the protein synthesis stress associated 
with this protein burden effect. If there are less proteins being 
synthesized, then ribosomal efficiency for the essential proteins 
might be improved. Many proteins are present in cells as part of 
an adaptive response, and some of these have little value in a 
cultivation system in which specific growth conditions can be main- 
tained. Estimates on non-essential genes are very variable and 
depend on multiple criteria used to make the identification, often 
looking at it from a genome viewpoint when, from this particular 
application viewpoint, it would need to assess the usefulness of the 
genes actually transcribed under the chosen growth condition. We 
will come back to some of these aspects when discussing genetic 
stability later in the chapter, but what are our options to alleviate 
this phenomenon? 

Rather than treating the consequence of the loss of fitness, it 
might be better to attack the cause of the problem: the naturally 
low Kcat values of these enzymes. There is tremendous scope for 
redesigning these enzymes so that pathway flux can be maintained 
without such drastic overexpression. The problem is not simple as 
natural evolution has favored low activity for many secondary 
metabolite pathway reactions, and this will certainly call on the 
automated exploration of synthetic biodiversity coupled to AI 
technologies. However, gains in specific activity would remove 
some of these protein burden situations and the metabolic stress 
associated and be game-changers for such speciality chemical con- 
version efficiencies. 

Beyond this rupture with the way we have been generally 
looking at how to engineer organisms, we also have a very prag- 
matic problem on how to optimize such pathways as the combina- 
tory options that need to be explored are huge if we do this reaction 
by reaction. In this respect, prokaryotes with their operon-based 
coding offer some advantages [23]. Let’s take a classical pathway 
with say 14 enzymes involved. If you look at just four different 
promoters for each reaction, you have 4'* different constructs or 
2.7 x 10% strains to construct and assess. If you segregate this 
pathway into say 4 operons, you can explore the same experimental 
space in a 4* matrix, so only 256 constructs [24]. Further optimi- 
zation can be achieved rapidly in such a modular logic by doing 
intra-operonic balancing using defined RBS sequences in what has 
come to be known as multidimensional heuristic optimization [25 ] 
which can rapidly establish a heat map-type logic of the extent to 
which the pathway can be optimized (Fig. 3) without having to 
have biofoundry-type facilities available. 


Translating Metabolic Engineering to Industry 11 


300 
Global optimization of 


Inter-module expressions transcription t 


. Promater Kerary 
= " Promoter 
f > LG 
: ’ : \ vr 
e * a oS 


=72*>"" translation 4 


Local optimization of ay 
intra-module expressions 7 >} . 
‘a 1 
can “—- 


250 


200 


150 


100 


Fig. 3 Multidimensional optimization logic using a modular approach (a) and a typical pathway flux heat 
diagram showing best option construction space (b) 


As pathway flux gets closer to a biochemical network maxi- 
mum, or co-factor availability becomes a bottleneck, optimization 
needs to be extended to include updates to the central metabolic 
pathways, but this is a complex challenge due to the highly regu- 
lated metabolic topology of this network. Quite considerable prog- 
ress can still be made in the speciality chemicals domain before 
having to engage in this additional level of complexity. However, 
these molecules are often challenging because these are associated 
with a certain toxicity for the producer strain and maybe we need to 
concentrate more on engineering efflux systems to remove these 
compounds efficiently from the cells and then employ fermentation 
technology utilizing multi-phasic technologies to remove the com- 
pound directly from the aqueous phase, and hence overcoming 
these toxic effects which otherwise accentuate the inevitable loss 
of fitness in such engineered cells. 

So far we have been considering optimizing what are predomi- 
nantly natural pathways re-engineered to boost performance but 
still suffering from the overriding constraint in many biological 
systems. Overall metabolic systems have evolved as a system with 
a view to favoring growth, first and foremost rapid growth, but able 
to switch to efficient growth if substrate limitations occur. One of 
the guiding principles in natural evolution is therefore to enable 
pathways to be regulated so as to maintain optimal homeostasis 
which in turn ensures best metabolic efficiency relating to growth 
and fitness. Of course, when looking at biotechnology applications, 
the first thing we have to do is bypass the complex regulatory 
phenomena that maintain a balanced supply of all anabolic precur- 
sors while often we retain the basic reaction sequence that nature 
has evolved. 

If we shift into a different logic, we might ask if an organic 
chemist would derive the same reaction sequence and whether their 
pathway would be compatible with a specific growth regime that we 
as biotechnologists ought to be able to control so as to modify the 
fitness criteria and enable those pathways with best thermodynamic 
efficiency to be proposed rather than those in which the complexity 


12 Fayza Daboussi and Nic D. Lindley 


1.4 Genetic Stability 


has been selected to enable better control of a constrained require- 
ment for that pathway [26] and more generally to maintain 
growth-related homeostasis. This would often require engineering 
of the enzymes to adapt them to novel reactions and assembling 
such reactions in a novel manner, but it opens up new possibilities 
to overcome key pathway limitations which would be otherwise 
difficult to resolve. 


As the metabolic engineering challenge increases to redesign and 
express more complex pathways, the question of genetic stability 
will become increasingly important together with a more detailed 
knowledge of how genomic integration can by the choice of the 
integration site modulate the efficiency of gene expression. As we 
have seen, engineering new functions into cells creates a loss of 
fitness, accentuated by the reinforced process-induced stress asso- 
ciated with poorly mixed large-scale fermenters and what is termed 
metabolic burden. 

The instability of production strains is the result of several 
environmental constraints intrinsic to the production process but 
also those intrinsic to the organism (metabolic cost, toxicity, DNA 
repair mechanisms, etc.). One of the consequences of this is a 
tendency for cells to try and remove the cause of this burden; the 
novel genetic elements introduced into the genome and the various 
factors involved are detailed in a recent paper [27]. In many cases, 
we are still reliant on multi-copy plasmid technology and the intrin- 
sic loss of such plasmids which have used either antibiotic resistance 
(unacceptable in industrial fermentations for many applications) or 
auxotrophic complementation to limit loss. The problem 
concerning most of these systems is that while they do offer some 
protection as long as the selective pressure is maintained, the full 
force of this protection is concentrated on the loss of the last copy, 
so expression profiles can change throughout the production 
phase. This becomes more acute when the industrialization tries 
to prolong the period of production and most acute when shifting 
to continuous or semi-continuous production modes. 

The classical response to this lack of plasmid stability is to 
integrate the novel pathway into the genome, and when envisaging 
this strategy, the question that begs to be answered is “can I favor 
genetic stability and best expression of the genes introduced by 
choosing where in the genome I make the integration?”. Today one 
of the challenges which interfaces metabolic engineering with sys- 
tems biology is to better understand how genome fine structure can 
be exploited to limit genetic instability. Of course genetic instability 
often leads to yield loss but also provokes unpredictability of pro- 
duction and knock on effects for product recovery. 

Quite often a production organism has multiple copies of the 
required genes present in the genome, and unless designed so as to 
reduce homology using full scope of alternative codons, recombi- 
nation is always likely to occur. Likewise, presence of transposable 


1.5 The Concept of 
Genomic Safe Harbors 


Translating Metabolic Engineering to Industry 13 


elements, frequent in many microbial genomes, will tend to facili- 
tate genome editing as a survival response to stress conditions. 
Removing such sequences from the production host has been 
shown to increase genome stability in strains of Escherichia coli, 
engineered for 1,4-butanediol production [28], and no doubt this 
approach needs to be extended to all chassis organisms envisaged 
for industrial production. In a wider logic, synthetic biology pro- 
jects to create synthetic chromosomes can remove all such mechan- 
isms favoring genetic instability such as has been demonstrated for 
the yeast genome project in which transposon elements were 
removed [29]. Maintaining introduced pathways has a metabolic 
cost to the cell and will inevitably lead to attempts to remove such 
genes requiring strain monitoring, preventive measures to attenu- 
ate this probability and choice of fermentation conditions which 
uncouple production from growth would certainly help limit the 
consequences, but can we construct our strains in such a way that 
best possible stability can be ensured? 


The problem of transgene instability is a well-known phenomenon 
in the pharmaceutical industry, where production cell lines are 
mainly generated by random transgene integration. As a result, 
these cell lines often need to be discarded during the cultivation 
phase due to a progressive loss of productivity, which may be the 
consequence of chromosomal rearrangements and/or of transcrip- 
tional repression by methylation. To circumvent these issues, sev- 
eral groups have pointed out the necessity of integrating transgenes 
at specific loci, which should allow safe and stable expression over 
time. The notion of “safe harbor” appeared at the very beginning of 
this century with the first gene therapy projects aiming to introduce 
a copy of a functional gene into the cells of patients with a 
defective gene. 

The success of gene therapy requires a stable expression of the 
introduced gene and without deleterious impact for the organism. 
Thus, the introduction of the gene in these safe loci is the sine qua 
non of a gene therapy guaranteeing maximum safety and efficacy. 
These integration sites called genomic safe harbors (GSHs) are 
defined as chromosomal locations where a transgene can integrate 
and function in a predictable manner without disrupting the activ- 
ity of endogenous genes and altering the viability of the organism 
[30]. Most GSHs were identified after random integration of a 
lentivirus carrying a cassette containing either a promoter-less 
reporter gene encoding antibiotic resistance, green fluorescent pro- 
tein, or B-galactosidase to make a promoter trapping approach or a 
full expression system followed by phenotypic screening. These 
approaches were successfully used in human cells [31, 32], embry- 
onic mouse cells [33], and CHO cells [34] to identify sites for 
expression of these transgenes. 


14 Fayza Daboussi and Nic D. Lindley 


1.6 Identification 
of Integration Loci 
in Microbial Cell 
Factories 


1.7. Genome Editing, 
a Powerful Tool to 
Target Desired Sites 


Furthermore, the lack of proven pathology after modification 
at these loci has subsequently led to a consequent interest in gene 
therapy though their implantation in gene-rich regions may pose a 
risk of deregulation of adjacent genes [35]. As such, they do not fit 
the selection criteria initially proposed, i.e., (i) be at least 50 kb 
from the 5’ end of a gene, (ii) be at least 300 kb from oncogenes, 
(iii) be at least 300 kb from genes coding for microRNAs, (iv) be 
outside a transcription unit of a gene, and (v) be outside ultra- 
conserved regions. These criteria are intended to limit the risk of 
disruption of endogenous genes as well as long-distance interac- 
tions between vector-encoded transcriptional activators and adja- 
cent genes [30]. Research continues to find the best safe harbor 
sites in various genomes, but this very targeted logic with specific 
objectives is progressively becoming a gold standard for all meta- 
bolic engineering projects which need not only to optimize both 
the stability of the genetic information added to the production 
host but also to minimize direct consequences by modifying 
expression of adjacent genes. 


The identification of integration loci in microorganisms, especially 
yeasts, emerged in the 2010s with the rise of synthetic biology. As 
we have seen, generating production organisms able to synthesize 
complex molecules requires addition of quite large numbers of 
genes, and there is a real requirement to ensure stable and predictable 
transgene expression without affecting neighboring genes. One of the 
first studies involved introducing the LacZ gene into 20 different 
integration sites in the Saccharomyces cerevisiae genome and measuring 
B-galactosidase activity [36]. The study revealed up to eightfold differ- 
ences in activity depending on the integration locus and showed that 
regions near telomeres were less favorable for expression and that 
regions near replicating sequences were more favorable. Later work 
validated 11 individual integration loci located in the intergenic 
regions of S. cerevisiae chromosomes X, XI, and XII for their abilities 
to ensure high transgene expression [37]. In addition, the sites were 
separated by essential genes, which prevents loss of integrated frag- 
ments through recombination and ensures the stability of the strain. 
Finally, these sites were all located in a minimum of 750 bp intergenic 
regions to reduce the impact on neighboring genes. Once such loci 
had been validated, they illustrated the full potential for metabolic 
engineering by introducing the seven-step indolylglucosinolate path- 
way of Arabidopsis thaliana, a multigene pathway with up to 22 genes. 


The development of cheap and easy-to-use genome editing tech- 
nologies has increased our ability to manipulate the genetic makeup 
of cells and microbes. This technology allows (i) the insertion of 
DNA fragments into targeted genomic locations; (ii) the deletion 
of small and large DNA fragments; or (iii) the introduction of point 
mutations. The development of genome editing kits including 
nucleases, guide RNAs, and recombination templates has greatly 


1.8 Creation of 
Landing Pads 


Translating Metabolic Engineering to Industry 15 


accelerated the identification and validation of integration loci in 
S. cerevisiae but also in other yeasts where homologous recombina- 
tion is less efficient. Thus, Nielsen’s group developed the 
EasyClone-MarkerFree vector toolkit [38] allowing stable integra- 
tion in both laboratory and industrial strains of S. cerevisiae and 
high gene expression at these 11] individual sites [39, 40]. More 
recently, the same group developed an Expansion of EasyClone- 
MarkerFree toolkit for S. cerevisiae genome with eight new integra- 
tion sites [41]. The challenge now is to deploy these genetic tools in 
polyploidy industrial strains. Recent papers have demonstrated the 
power of the HI-CRISPR genome editing tools to disrupt four 
genes in diploid and triploid yeast [42 ]. The next step will consist to 
introduce multiple copies of metabolic pathways in polypoid indus- 
trial strains. 

In Pichia pastoris or Kluyveromyces species, which, unlike 
S. cerevisiae, have a low frequency of spontaneous homologous 
recombination, the introduction of transgenes at a predefined 
locus is possible using CRISPR/Cas9 system and clearly favored 
when key genes involved in NHEJ are deleted [43-45], which 
allowed the validation of integration loci. Furthermore, there is a 
strong interest in finding safe harbors in non-conventional organ- 
isms of biotechnological interest such as the microalgae Phaeodac- 
tylum tricornutum [46], in the oleaginous yeast Yarrowia 
lipolytica [47 |. 


Once safe harbor loci are validated, it is easy to create “landing 
pads,” i.e., sites in which transgenes can be routinely inserted for 
stable and reliable expression with predictable homologous recom- 
bination frequency [34]. These landing pads contain generally a 
recombination site and a selection marker. They are very interesting 
tools to create multi-copy site-specific integration platform. Thus, 
multi-copy (18 and 25 copy genome) of 2,3-butanediol biosynthe- 
sis pathway was reported [48, 49]. Recently, several studies have 
reported the development of artificial chromosomes as a tool to 
easily and efficiently assemble the genes and chromosomal elements 
necessary to control the expression of metabolic pathways. These 
systems have the advantage of circumventing the unpredictable 
impact of the chromosomal environment (chromatin accessibility, 
methylation, microsatellite sequences, etc.) on the site-specific inte- 
gration frequency. 

For example, Yarrowia lipolytica (ylAC) artificial chromosomes 
were used to assemble and express a large metabolic pathway 
including three key genes for xylose utilization (XYL1, XYL2, and 
XKS1) and three for cellobiose consumption (CBP1, CDT1, and 
scPGM2) [50]. In addition, a study showed the development of a 
supernumerary neochromosome for rational engineering of the 
yeast genome. The ability of synthetic supernumerary chromo- 
somes to serve as a landing platform for the integration of native 


16 Fayza Daboussi and Nic D. Lindley 


1.9 Creating 
Fermentation-Friendly 
Chassis Organisms 


and heterologous metabolic pathways has been demonstrated. 
However, it was noted that neochromosome expression of an 
essential pathway reduced the host-specific growth rate by approxi- 
mately 14—24% [51]. These new genetic tools are useful to express a 
large number of metabolic pathways; the stability of these systems 
in bioreactors will allow them to be validated for industrial 
applications. 

Identification of safe harbors has been a long and tedious 
process but absolutely necessary to ensure a constant and predict- 
able production of target molecules. It is mostly a trial-and-error 
approach that is performed in four steps: random integration of a 
transgene, identification of the integration sites, evaluation of the 
expression level, and then evaluation of the consequences at the cell 
level (transcriptome, growth. etc.). Recently, bioinformatics pipe- 
lines have emerged based on a rational approach whose criteria 
meet those of the established GSH criteria [52]. Although these 
pipelines are currently developed on mammalian cells, it is likely 
that they will facilitate similar toolbox development for use in 
microbial cell factories of biotechnological interest (yeast, bacteria, 
algae, etc.) with rational criteria based on gene density, chromatin 
accessibility, presence of transposable elements, etc. 


One of the common problems encountered in translating 
promising newly engineered microbes with great performance 
under laboratory conditions is that they sometimes struggle to 
maintain the same performance when scaled to full industrial pro- 
duction. This reflects the additional metabolic stress (see above) 
which inevitably occurs when laboratory-scale fermentation is 
shifted to full industrial-scale fermenters which often function 
close to the absolute limits of mass transfer and inevitably provoke 
some degree of environmental heterogeneity. This is frequently 
overlooked during the strain development strategy and is inherent 
to the sequential pattern of process evolution in which the biopro- 
cess engineering aspects are not usually examined until the first 
generation of high-performance hosts is ready. While it would 
make a lot more sense to get the process engineering constraints 
identified and included in the initial project plan from the onset, 
this is not common, at least in academic projects. 

Knowing what can easily be achieved by processing technolo- 
gies and what constraints this might impose on the microbe is 
essential information which could be used to plan how best to 
combine the advantages that strain engineering can bring with 
those solutions which can be resolved by the fermentation and 
downstream stages of the process. It is also vital to take into account 
the physical limitations which are associated with scaling the pro- 
cess. As fermentation volumes increase, then mixing and mass 
transfer limitations appear due to the geometry of the reactor. As 
most production cycles are based on a fed-batch logic with high 


Translating Metabolic Engineering to Industry 17 


biomass concentrations to attain the volumetric productivity 
needed, the difficulty of transferring this type of performance 
from classical laboratory apparatus which normally have good 
transfer dynamics to a larger scale fermenter is a real problem. 

We tend to engineer high-performance strains which require a 
very tight control over the conditions that have to be provided in 
the fermenter which are difficult to attain at large scale. More effort 
is needed to design or evolve the chassis organism to be used to be 
able to show a robust phenotype when subject to the inevitable 
transient stress conditions that occur due to lack of reactor homo- 
geneity in which spatiotemporal variations occur in key factors such 
as nutrient availability, pH gradients, dissolved oxygen, co-product 
transient accumulation, etc. These variations provoke low-intensity 
but high-frequency stress for the microbes requiring some effort to 
gain in phenotypic robustness and would ensure facilitated scaling 
of the actual production organisms. Today, computation fluid 
dynamics can model the mixing characteristics of any fermenter, 
and increasingly incorporate biological kinetics [53] and not only 
predict the actual distribution of conditions seen throughout a 
fermentation defining where strain engineering might have to 
look at fitness characteristics but also help design small-scale reac- 
tors which approximate the same constraints as found in full-scale 
fermentations. These constraints can easily diminish the intensified 
flux patterns seen at lab scale and significantly diminish yields. 

Bioreactor design is notoriously conservative and has not really 
evolved to any great extent since the initial investments for indus- 
trial biotechnology and in many cases are unlikely to change radi- 
cally though clearly there is scope to improve mixing within large- 
scale reactors and help offset this dilemma. One major shift which 
mirrors the way batch chemical synthesis is beginning to shift 
towards continuous flow chemistry could help simplify the chal- 
lenges of industrial biotechnology. A relatively small number of 
processes have been developed with a continuous culture system 
which stabilizes a specific pseudo-stationary production environ- 
ment which would greatly facilitate the optimization of strains 
adapted to such stable conditions and able to tune their perfor- 
mance to this constant environment. Currently our carefully 
designed microbial factories, exploited in batch or fed-batch con- 
ditions, have to come to terms with an environment in which the 
principle constraints are changing in a dynamic nature, and indeed 
much of the production phase will not be under conditions in 
which pathway flux has been optimized. Fixing a stable environ- 
ment means that you can optimize for a given condition, though 
obviously the genetic stability issues discussed above become even 
more important. This cultivation method has shown itself to be a 
key factor in understanding microbial physiology and would be a 
stable basis for any manipulation of that same physiology. A sim- 
plified system is always simpler to engineer. 


18 Fayza Daboussi and Nic D. Lindley 


2 Conclusions 


Metabolic engineering has an increasingly sophisticated toolbox 
available, and many of the problems that were difficult to resolve 
20 or more years ago are becoming feasible in consolidated strate- 
gies which can identify upfront where the big challenges are going 
to be. Better integration of the entire value chain from the onset 
would speed up the transfer of promising innovation into validated 
processes and avoid some of the bottlenecks that have slowed down 
industrial exploitation of some aspects of microbial cell factories to 
date. Increasing the manner in which we use mathematical model- 
ing and the in silico testbed, so as to focus the experimental input 
on most probable solutions, is the essential glue that can fuse the 
efforts in systems biology and the application potential of metabolic 
engineering. Better understanding how microbes function in real- 
istic industrial conditions will enable robustness to be built into the 
DNA of our high-performance cell factories and create a rapid 
development pipeline to boost success rates when transferring lab- 
oratory studies to industry. Today the framework exists and needs 


to be more widely employed. 


References 


1. 


Voigt CA (2020) Synthetic biology 2020- 
2030: six commercially available products that 
are changing the world. Nat Commun 11: 
6739-6745 


. Bailey JE (1991) Towards a science of meta- 


bolic engineering. Science 


1668-1675 


252(5013): 


10. 


ll. 


biology approaches to strain development. 
Curr Opin Biotechnol 30:51-58 

Products made from petroleum. Https://www. 
ranken-energy.com/index.php/products- 
made-from petroleum/ 

Dietrich JA, McKee AE, Keasling JD (2010) 
High throughput metabolic engineering: 


3. Stephanopoulos G, Vallino JJ (1991) Network advances in small molecule screening and selec- 
rigidity and metabolic engineering in metabo- tion. Annu Rev Biochem 79:563-565 
lite overproduction. Science 252(5013): 12. Volk MJ, Lourentzou I, Misra S et al (2020) 
1675-1681 


. Stephanopoulos G (2002) Metabolic engineer- 


ing by genome shuffling. Nat Biotechnol 20: 
665-668 


. Chao R, Misra S, Si T et al (2017) Engineering 


biological systems using automated biofound- 
eries. Metab Eng 42:98-108 


. Nielsen J (2017) Systems biology of metabo- 


lism. Annu Rev Biochem 86:245-275 


. Choi S, Song CW, Shin JH et al (2015) Bior- 


efineries for the production of top building 
block chemicals and their derivatives. Metab 
Eng 28:223-239 


. Tsage Y, Kimisaguchi H, Susaki K et al (2016) 


Engineering cell factories for producing build- 
ing block chemicals for biopolymer synthesis. 
Microb Cell Factories 15:19 


. Wendish VF (2014) Microbial production of 


amino acids and derived chemicals: synthetic 


13. 


14. 


15. 


16. 


Biosystems design by machine-learning. ACS 
Synth Biol 9(7):1514-1533 

Carbonell P, Radivojevic T, Martin HG (2019) 
Opportunities at the intersection of synthetic 
biology, machine learning and automation. 
ACS Synth Biol 8(7):1474-1477 

Wu G, Yan Q, Jones JA et al (2016) Metabolic 
burden: cornerstones in synthetic biology and 
metabolic engineering platforms. Trends Bio- 
technol 34(8):652-644 

Raveendran S, Parameswaran B, Ummalyma 
SB et al (2018) Applications of microbial 
enzymes in food industry. Food Technol Bio- 
technol 56(1):16-30 

Rudge SR, Ladisch MR (2020) Industrial chal- 
lenges of recombinant proteins. Adv Biochem 
Eng Biotechnol 171:1-22 


. Chen X, Zhang C, Lindley ND (2019) Meta- 


bolic engineering strategies for sustainable 


18. 


19. 


20. 


21, 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


31 


Translating Metabolic Engineering to Industry 19 


terpenoid flavour and fragrance synthesis. J 
Agric Food Chem 68(38):10252-10264 

Lee S, Kim HU (2015) Systems strategies for 
developing industrial microbial strains. Nat 
Biotechnol 33:1061-1072 

Sharing the way to innovation. https://www. 
ibisba.eu 

Zhang C, Chen X, Orban A et al (2020) Agro- 
cybe aegerita serves as a gateway for identifying 
sesquiterpene biosynthetic enzymes in higher 
fungi. ACS Chem Biol 15:1268-1277 

Shukal S, Chen X, Zhang C (2019) Systematic 
engineering for high-yield production of viridi- 
florol and amorphadiene in auxotrophic Escher- 
ichia coli. Metab Eng 55:170-178 

Nouaille S, Mondeil S, Finoux AL et al (2017) 
The stability of an mRNA is influenced by its 
concentration: a potential physical mechanism 
to regulate gene expression. Nucleic Acids Res 
45(20):11711-11724 

Smanski MJ, Bhatia S, Zhao D et al (2014) 
Functional optimisation of gene clusters by 
combinatorial design. Nat Biotechnol 32(12): 
1241-1249 

Zhang C, Chen X, Lindley ND et al (2018) A 
plug-n-play modular metabolic system for the 
production of apocarotenoids. Biotechnol 
Bioeng 115:174-183 

Zhang C, Seow VY, Chen X et al (2018) Mul- 
tidimensional heuristic process for high-yield 
production of astaxanthin and fragrance mole- 
cules in Escherichia coli. Nat Commun 9:1858- 
1870 

Walther T, Calvayrac F, Malbert Y et al (2018) 
Construction of a synthetic pathway for the 
production of 2,4-dihydroxybutyric acid from 
homoserine. Metab Eng 45:237-245 
Rugbjerg P, Sommer MOA (2019) Overcom- 
ing genetic heterogeneity in industrial fermen- 
tations. Nat Biotechnol 37:869-876 

Burgard A, Burk MJ, Osterhout R et al (2016) 
Development of a commercial scale process for 
production of 1, 4-butanediol from sugar. Curr 
Opin Biotechnol 42:118-125 

Richardson SM, Mitchell LA, Stracquadanio G 
et al (2017) Design of a synthetic yeast 
genome. Science 355(6329):1040-1044 
Sadelain M, Papapetrou E, Bushman F (2012) 
Safe harbours for the integration of new DNA 
in the human genome. Nat Rev Cancer 12:51- 
58 


. Papapetrou EP, Lee G, Malani N et al (2011) 


Genomic safe harbors permit high pB-globin 
transgene expression in thalassemia induced 
pluripotent stem cells. Nat Biotechnol 29:73- 
78 


32. 


33. 


34 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


42. 


43. 


44. 


Papapetrou EP, Schambach A (2016) Gene 
insertion into genomic safe harbors for 
human gene therapy. Mol Ther 24:678-684 
Friedrich G, Soriano P (1991) Promoter traps 
in embryonic stem cells: a genetic screen to 
identify and mutate developmental genes in 
mice. Genes Dev 5(9):1513-1523 


. Gaidukov L, Wroblewska L, Teague Nelson BT 


et al (2018) Multi-landing pad DNA integra- 
tion platform for mammalian cell engineering. 
Nucleic Acids Res 46:4072-4086 


Pavani G, Amendola M (2021) Targeted gene 
delivery: where to land. Front Genome Ed 2: 
36 


Bai Flagfeldt D, Siewers V, Huang L et al 
(2009) Characterization of chromosomal inte- 
gration sites for heterologous gene expression 


in Saccharomyces cerevisiae. Yeast 26(10): 
545-551 


Mikkelsen MD, Buron LD, Salomonsen B et al 
(2012) Microbial production of indolylgluco- 
sinolate through engineering of a multi-gene 
pathway in a versatile yeast expression platform. 
Metab Eng 14(2):104-111 

Jessop-Fabre MM, Jakociunas T, Stovicek V 
et al (2016) EasyClone-MarkerFree: a vector 
toolkit for marker-less integration of genes 
into Saccharomyces cerevisiae via CRISPR- 
Cas9. Biotechnol J 11(8):1110-1117 


Jensen NB, Strucko T, Kildegaard KR et al 
(2014) EasyClone: method for iterative chro- 
mosomal integration of multiple genes Saccha- 
romyces cerevisiae. FEMS Yeast Res 14(2): 
238-248 


Stovicek V, Borodina I, Forster J (2015) 
CRISPR-Cas system enables fast and simple 
genome editing of industrial Saccharomyces cer- 
evisiae strains. Metab Eng Commun 2:13-22 


Babaei M, Sartori L, Karpukhin A et al (2021) 
Expansion of EasyClone-MarkerFree toolkit 
for Saccharomyces cerevisiae genome with new 
integration sites. FEMS Yeast Res 21(4): 
foab027 

Lian J, Bao Z, Hu S et al (2018) Engineered 
CRISPR/Cas9 system for multiplex genome 
engineering of polyploid industrial yeast 
strains. Biotechnol Bioeng 115(6):1630-1635 
Weninger A, Fischer JE, Raschmanova H et al 
(2018) Expanding the CRISPR/Cas9 toolkit 
for Pichia pastoris with efficient donor integra- 
tion and alternative resistance markers. J Cell 
Biochem 119(4):3183-3198 

Gao J, Gao N, Zhai X et al (2021) Recombina- 
tion machinery engineering for precise genome 
editing in methylotrophic yeast Ogataea poly- 
morpha. iScience 24(3):102168 


20 


45. 


46. 


47. 


48. 


49. 


Fayza Daboussi and Nic D. Lindley 


Rajkumar AS, Varela JA, Juergens H et al 
(2019) Biological parts for Kluyveromyces 
marxianus synthetic biology. Front Bioeng 
Biotechnol 7:97 

Defrel G, Marsaud N, Rifa E et al (2021) Iden- 
tification of loci enabling high-level heterolo- 
gous gene expression in the diatom 
Phacodactylum tricornutum. Front Bioeng 
Biotechnol 864 

Larroude M, Park YK, Soudier P et al (2019) A 
modular Golden Gate toolkit for Yarrowia 
lipolytica synthetic biology. Microb Biotechnol 
12(6):1249-1259 

Shi S, Liang Y, Zhang MM et al (2016) A 
highly efficient single-step, markerless strategy 
for multi-copy chromosomal integration of 
large biochemical pathways in Saccharomyces 
cerevisiae. Metab Eng 33:19-27 

Huang S, Geng A (2020) High-copy genome 
integration of 2, 3-butanediol biosynthesis 
pathway in Saccharomyces cerevisiae via in vivo 
DNA assembly and replicative CRISPR-Cas9 


50. 


51. 


52. 


53. 


mediated delta integration. J Biotechnol 310: 
13-20 

Guo ZP, Borsenberger V, Croux C et al (2020) 
An artificial chromosome ylAC enables efficient 
assembly of multiple genes in Yarrowia lipoly- 
tica for biomanufacturing. Commun Biol 3(1): 
1-10 

Postma ED, Dashko S, van Breemen L et al 
(2021) A supernumerary designer chromo- 
some for modular in vivo pathway assembly in 
Saccharomyces cerevisiae. Nucleic Acids Res 
49(3):1769-1783 

Aznauryan E, Yermanos A, Kinzina E et al 
(2022) Discovery and validation of human 
genomic safe harbor sites for gene and cell 
therapies. Cell Rep Methods 2:100154 
Haringa C, Jang W, Wang G et al (2018) 
Computational Fluid Dynamics simulation of 
an industrial P. chrysogenum fermentation with 
coupled 9-pool metabolic model: towards 
rational scale-down and design optimisation. 
Chem Eng Sci 175:12-24 


Check for 
updates 


Synthetic Biology Meets Machine Learning 


Brendan Fu-Long Sieow, Ryan De Sotto, Zhi Ren Darren Seet, 
In Young Hwang, and Matthew Wook Chang 


Abstract 


This chapter outlines the myriad applications of machine learning (ML) in synthetic biology, specifically in 
engineering cell and protein activity, and metabolic pathways. Though by no means comprehensive, the 
chapter highlights several prominent computational tools applied in the field and their potential use cases. 
The examples detailed reinforce how ML algorithms can enhance synthetic biology research by providing 
data-driven insights into the behavior of living systems, even without detailed knowledge of their underly- 
ing mechanisms. By doing so, ML promises to increase the efficiency of research projects by modeling 
hypotheses in silico that can then be tested through experiments. While challenges related to training 
dataset generation and computational costs remain, ongoing improvements in ML tools are paving the way 
for smarter and more streamlined synthetic biology workflows that can be readily employed to address 
grand challenges across manufacturing, medicine, engineering, agriculture, and beyond. 


Key words Machine learning, Synthetic biology, Protein engineering, Metabolic engineering 


1. Introduction 


The era of synthetic biology started long before the term was first 
used by Barbara Hobom in 1980 to describe microbes that were 
genetically modified using DNA recombinant technology 
[1]. Fueled by the molecular biology revolution that took place in 
the mid-twentieth century, scientists set their sights on achieving 
the ability to precisely engineer microorganisms—and by the early 
1990s, their dream was finally realized. Since then, advancements in 
sequencing and genomics have paved the way for the evolution of a 
discipline which aims to create, control, and program cellular 
behavior and metabolism [2]. 

At the turn of the millennium, “synthetic biology” referred to 
the synthesis of unnatural compounds that function in living sys- 
tems. In exploring and taking interchangeable parts from living 
systems (zology) and assembling them unnaturally (synthetic) to 
create devices that resembled living systems, synthetic biologists 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_2, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


21 


22 Brendan Fu-Long Sieow et al. 


The term 
“Synthetic Biology” Programmable RNA 
was coined switches using Deep 
Neural Networks 
Development of (DNN) 


Study of the lac operon molecular cloning and 
in E. coli PCR Synthetic gene circuits Alpha Fold 2 
20th CENTURY gk 1970s pe 1990s J. 2010s pl 2030s 
-_———4 pen = 1 
©) © ©) ©} © Xo} 
1960s 1980s 2000s T 2020s FUTURE 
Recombinant DNA Automated DNA sequencing Deep Learning in Integration of cells into 
technology and improved functional genomics nor-living materials or 
computational tools (DeepBind) and sequence electronics 


/ data analysis (DeepNano) 
‘Omics’ technologies for 


Systems Biology 


Fig. 1 A brief timeline of the breakthroughs in the field of machine learning and synthetic biology from the 
1960s and beyond 


attempted to recreate emergent properties of living systems includ- 
ing inheritance and evolution [3, 4]. During its foundational years, 
synthetic biologists created simple circuits for gene regulation and 
tested them on Escherichia coli, a classic molecular biology work- 
horse chosen for its ease of manipulation and our extensive knowl- 
edge on its genetics and genomics [5 ]. 

By the mid-2000s, synthetic biology expanded dramatically in 
its endeavors to create and construct biological systems, with a 
long-term goal of engineering whole genomes [6]. But as assem- 
bling interchangeable parts and designing circuits became more 
complex and ambitious, synthetic biologists had to grapple with 
the disproportionate time required to design systems facilitating 
the proper function of the synthesized circuits (Fig. 1) [7]. 

Indeed, while synthetic biology has already delivered novel 
solutions to long-standing challenges in the global healthcare, 
agriculture, manufacturing, and environmental sectors, there is a 
lingering perception that the field has yet to live up to its full 
potential. The rise of artificial intelligence, robotics, and automa- 
tion, however, may help synthetic biology overcome such percep- 
tion [8, 9]. By reducing the time and cost associated with designing 
intricate biological systems, such frontier technologies stand to 
improve the field’s return of investment [9]. 

Over the years, computational methods have become an essen- 
tial part of biology, mirroring the increase of digitalization across 
practically all sectors. The continuous development of sophisticated 
algorithms have ushered in new and improved tools in statistics, 
simulation, and data management—all of which are reshaping the 
way biological studies are performed [10]. Computational biology, 
with its large datasets streamlined by databases and statistical ana- 
lyses, provides a reference map for the field of biology [11]. 


Synbio x ML 23 


With its roots in pattern recognition, statistics, and data opti- 
mization, machine learning (ML) is a particular artificial intelli- 
gence method commonly used in computational biology. For 
instance, ML is typically applied by systems biologists to optimize 
fermentation conditions and product biosynthesis routes—reduc- 
ing the need for laborious benchwork and allowing for higher 
chances of experimental success (Wu et al. 2016b). 

One major objective of ML methodologies is to generate pre- 
dictive models based on an underlying algorithm and a given 
dataset containing features and labels across various samples. A 
typical ML workflow starts by inputting data, which is then pro- 
cessed with a set of mathematical formulas and statistical assump- 
tions. This process, called training, pinpoints the optimal 
configuration of model parameters with the aim of translating 
features into an accurate prediction of labels based on the given 
dataset. After identifying the optimized parameters, a new dataset 
can be used to generate output. A model that can accurately predict 
the training data and independent datasets is deemed to have 
properly “learned.” Models that can accurately predict the training 
dataset but not the independent ones are called “overfit” models, 
while those that can neither predict the training dataset nor gener- 
alize to new data are dubbed “underfit” models—both of which are 
major causes for poor performance in ML approaches. Overfitting 
and underfitting models can be respectively resolved by decreasing 
or increasing the complexity of the model used for learning, respec- 
tively [12, 13]. 

Meanwhile, ML methods fall under two overarching cate- 
gories: supervised and unsupervised learning. Unsupervised meth- 
ods like principal component analysis and hierarchical clustering use 
patterns in the features of the input data to produce visualizations 
that help discriminate changes in groups. In contrast, supervised 
learning is used when labels that can recognize patterns in the input 
data are already known [14]. For example, if the microbiomes of 
healthy and diseased individuals are available, supervised ML can 
help accurately predict if a sample from another individual belongs 
to the healthy or the diseased groups. Both categories fall under 
deep learning, a specialized subset of ML that involves neural net- 
works, or algorithms inspired by the human brain. 

Accordingly, ML strategies can be applied to identify funda- 
mental design principles in synthetic biology—particularly in creat- 
ing components with enhanced novel functions, which, in turn, 
diversify molecular parts that are available for efforts in the field 
[12]. In this introductory chapter, we will discuss the integration of 
ML in synthetic biology, with an emphasis on the subfields of cell 
and metabolic engineering, as well as how such integration can 
enable synthetic biology to meet current challenges in uncovering 
the complexities of biological systems. 


24 Brendan Fu-Long Sieow et al. 


2 Computational Tools for Synthetic Biology Applications 


2.1 Machine 
Learning for Cell and 
Protein Engineering 


In synthetic biology, the subfield of cell engineering involves the 
assembly of biological components to form gene circuits or net- 
works that can work together with the internal cell machinery to 
restore, improve, or add novel functions to a chosen host cell 
[15]. These biological components often include elements that 
regulate the transcription and translation of proteins, as well as 
transcription factors that can be used to regulate the activity of 
other proteins. 

To design cells that behave in a predictable and reproducible 
way, synthetic biologists have sought to characterize the individual 
performance of known biological components, understand their 
fundamental mechanisms of action, and test the interactions of 
these components within the host cell mostly via trial-and-error 
experimental approaches [16]. While cell engineering techniques 
have become more sophisticated, there are still some hurdles faced 
by synthetic biologists. Given the limited understanding of design 
rules, designing novel biological components and identifying the 
interactions between host cell machinery and engineered compo- 
nents can be a challenge, posing troubleshooting difficulties. To 
this end, ML offers a path for optimally designing and fine-tuning 
biological components with predictable outcomes in the host cell. 
This includes applications in the optimization of gene expression, 
alteration of cellular function, and design of proteins (Fig. 2). 

Tuning gene expression in cells typically involves modifying 
and screening promoter and ribosome binding site (RBS) 
sequences [17] for transcriptional and translational regulation 
through experiments and computational predictive tools. How- 
ever, the latter often requires a comprehensive understanding of 
the regulatory mechanisms controlling gene expression (Choi et al. 
2019). While such tools can be effective, they may not be as useful 
when information is incomplete, especially in the case of 
non-model organisms. 

With synthetic biologists realizing ML’s potential in cell engi- 
neering, they have lessened wet lab optimization experiments— 
turning instead to screening and designing biological components 
in silico. Several research groups, for instance, have reported using 
neural networks, a popular ML algorithm, to guide the data-driven 
design of promoters [18-20] and RBS sequences [21] for 
controlling gene expression. For instance, Meng et al. have 
deployed neural networks to predict promoter strength with 
mutated promoters and RBS sequences as inputs [22]. Notably, 
their algorithm surpassed even mechanistic models based on posi- 
tion weight matrix and thermodynamics methods [23-25 ]. 


Optimizing gene expression 


Predicting promoter 
strength 


Predicting RBS 
strength 


Predicting plasmid 
expression strength 


Predicting 
transcription and 
translation 
regulation 


Predicting 
RNA-mediated 
genetic switch 
performance 


] 


J 


Optimizing tools for altering 
cell function 


<< 


Predict sgRNA 
on-target efficacy and 
off-target binding 


Machine Learning for 


Cell Engineering 


Synbio x ML 25 


| Optimizing search and 
design of proteins 


Search and annotate 
protein-encoding 
genes 


Predict enzyme 
function 


Predict enzyme 
promiscuity 


Guide directed 
evolution process 


Guide rational 
protein design 


Predict protein 
structures 


Fig. 2 Applications of machine learning for cell engineering. Three categories where machine learning can be 
used for: optimizing gene expression, optimizing tools for altering cell function, and optimizing the search and 
design of proteins. (Created with Biorender.com) 


Aside from predicting the gene expression of biological com- 


ponents like promoters or RBS from their sequences, other tools 
can also look at factors affecting gene expression. One example is 
SelProm, an open-source plasmid selection tool that houses a data- 
base for plasmid expression strength and a prediction tool based on 
partial least-squares regression [26]. SelProm can compare and 
identify inducible promoter expression systems similar to constitu- 
tive expression systems across various conditions including strain, 
media, inducer concentration, induction time point, and plasmid 
backbone. 

Beyond promoters and RBS sequences, ML can also predict 
gene expression by optimizing the biological components that take 
part in transcription and translation. For example, Tunney et al. 
used a feedforward neural network model in which information is 
constantly “fed forward” from one layer to the next—mirroring 
biological processes—to predict ribosome distribution along 
mRNA transcripts and translational elongation speeds from the 


26 


Brendan Fu-Long Sieow et al. 


coding sequence of mRNA transcripts [27]. Another study 
reported the use of a deep learning technique known as a convolu- 
tional neural network (CNN) to predict protein expression in 
Saccharomyces cerevisiae from the 5’ untranslated region (UTR) 
sequences of mRNAs [28]. The described model generated more 
active 5’ UTRs, leading to higher protein translational and expres- 
sion rates. Likewise, transcription regulation can also be predicted 
with ML. A tool called DeepTFactor also uses a deep neural net- 
work (DNN), to predict the transcription factors of both eukary- 
otic and prokaryotic origins [29]. Promisingly, DeepTFactor can be 
used to understand the transcriptional regulatory systems of an 
organism of interest for cell engineering applications. 

Synthetic biologists have also worked towards controlling gene 
expression by developing RNA-mediated genetic switches 
[30]. One such genetic switch is the riboswitch, an aptamer- 
containing mRNA molecule that can recognize and bind to specific 
ligands and, in turn, control gene expression through a conforma- 
tional change [31]. In 2019, Groher et al. reported combining a 
CNN with a classification algorithm called random forest analysis to 
develop a prediction model that accounted for the aptamer 
sequence’s biophysical properties, including entropy, stem melting 
temperature, and GC content. The model was then used to 
improve the dynamic range of a tandem tetracycline-dependent 
riboswitch (Groher et al. 2019). Another kind of genetic switch is 
the toehold switch, which consists of an RNA hairpin placed at the 
5’ end of a mRNA molecule, allowing translation to occur when 
triggered [31]. While riboswitches control gene expression 
through conformational changes, toehold switches can be distin- 
guished as they control gene expression via base pairing with target 
RNA sequences. In 2020, two studies described predicting toehold 
switch function through DNNs [32] combined with Sequence- 
based Toehold Optimization and Redesign Model (STORM) and 
Nucleic-Acid Speech (NuSpeak) [33]. Taken altogether, these pre- 
dictive tools enabled by neural networks will greatly help in devel- 
oping more robust and sensitive biological circuit components for 
molecular detection, biosensing, and precision diagnostics. 

On top of designing biological components to regulate gene 
expression, there is also a need to design more efficient tools for 
altering cell function. This can be achieved by using genome editing 
tools, such as the CRISPR-Cas system, to remove unwanted genes 
or permanently incorporate foreign biological components into the 
cell genome. While these tools have revolutionized the field of 
synthetic biology, there are still opportunities to improve 
CRISPR-Cas tools in terms of predicting and enhancing sgRNA 
binding to the desired target site as well as minimizing off-target 
binding. 

Earlier studies used the support vector machine model, a type 
of supervised ML, to enhance CRISPR-Cas9 activity [34, 35] but 


Synbio x ML 27 


were limited by the small size and low quality of training data. 
However, a combination of higher-throughput screening methods 
and deep learning has improved the accuracy of the newer sgRNA 
activity prediction tools. One example is the DeepCpf1 tool, which 
uses DNNs trained on large-scale sgRNA (AsCpfl: Cpfl from 
Acidaminococcus sp. BV3L6) activity datasets to predict on-target 
knockout efficacy (indel frequencies) [36]. 

Unlike previous studies that trained on medium-scale datasets, 
Kim et al. developed a high-throughput experimental approach 
which generated a large dataset of over 15,000 target sequence 
compositions and their corresponding indel frequencies suitable 
for applying deep learning approaches. Besides predicting 
on-target knockout efficacies, forecasting off-target sg RNA activity 
using regressive models and DNNs helps prevent undesirable edits 
that may result in genomic instability or the functional disruption 
of normal genes [37, 38]. To maximize both on-target efficacy 
(high sensitivity) and minimize off-target effects, Chuai et al. devel- 
oped the tool DeepCRISPR, which uses both unsupervised deep 
representation learning and DNNs. Indeed, DeepCRISPR man- 
aged to surpass classic ML methods and could be generalized to 
other cell types (Chuai et al. 2018). 

ML can also be applied in cell engineering to search and anno- 
tate protein-encoding genes within the genome. This is particularly 
useful for designing metabolic pathways and constructing them in 
production host cells [17]. Traditionally, the hidden Markov model 
is used for this purpose [39, 40]. In this method, genes are first 
identified in the genome via protein-coding signatures like the 
Shine-Dalgarno sequence and then functionally annotated based 
on their sequence homology search against a database of character- 
ized proteins. More recently, deep learning models have been used 
to identify [41, 42] and functionally annotate protein sequences 
[43] in genomes from large high-quality experimental datasets. 
One tool currently being leveraged to pinpoint protein sequences 
is DeepRibo, a DNN-based tool that harnesses high-throughput 
ribosome profiling coverage signals and candidate open reading 
frame sequences to map and identify translated open reading frames 
in prokaryotes. A similar tool, REPARATION, performs the same 
function using a random forest classifier [44]. 

Once new proteins have been discovered, the functional anno- 
tation of their sequences can be performed through DNN-based 
tools like DeepEC, which uses a protein sequence to precisely and 
quickly predict enzyme commission (EC) numbers [43]. EC num- 
bers classify enzymes based on the chemical reactions they catalyze 
and help in the accurate understanding of enzyme functions. 
Beyond DeepEC, alternative EC number prediction tools like Cat- 
Fam [45], DEEPre [46], DETECT v2 [47], ECPred [48], EFI- 
CAz2.5 [49], and PRIAM [50] can also be considered. In addition 
to determining enzyme function, ML could uncover and forecast 


28 


Brendan Fu-Long Sieow et al. 


enzymes that can catalyze novel reactions through enzyme promis- 
cuity. For instance, chemoinformatic techniques, partitioned quan- 
tum mechanics, and molecular mechanics can be used to predict 
metabolite-protein interactions in silico [51]. However, these tech- 
niques are computationally intensive and require domain expertise. 
Likewise, searching and matching promiscuous enzymes to a reac- 
tion are increasingly being performed with more computationally 
efficient techniques, such as the support vector machine [52] and 
the Gaussian process model [53]. These techniques make their 
predictions based on protein sequences (e.g., K-mers), reaction 
signatures (e.g., functional groups, chemical transformation prop- 
erties), and substrate affinity for proteins (Km values). Equipped 
with these tools, metabolic engineers now have new ways to find 
enzymes for novel biochemical reactions when no known enzyme is 
available. 

Another application of ML is the design and engineering of 
proteins. The most common approach is the use of directed evolu- 
tion, where proteins go through iterative experimental rounds of 
mutation and selection until the desired function and performance 
are achieved [54]. ML can guide the directed evolution process by 
reducing the number of experimental iterations to attain the 
desired protein. This involves using previous experimental data 
consisting of each protein’s sequence and its functional perfor- 
mance to generate a library of variants with higher fitness. Wu 
et al. simultaneously deployed multiple ML models and picked 
the models with the highest accuracy to more efficiently evolve 
two proteins: human guanine nucleotide-binding protein (GB1) 
and nitric oxide dioxygenase (NOD) from Rhodothermus marinus 
[55]. ML-assisted directed evolution has also been used to increase 
enzyme productivity [56], change fluorescent protein colors [57], 
and optimize protein thermostability [58]. 

Besides directed evolution, ML can also play a part in rational 
protein design. For instance, UniRep can learn the statistical repre- 
sentations of proteins (e.g., physicochemical properties, structural, 
evolutionary, and functional information) from 24 million Uni- 
Ref50 sequences [59] using neural networks. The tool managed 
to predict the stability of a large series of de novo proteins and 
functional changes from point mutations made in wild-type pro- 
teins. In an exploratory study from George Church’s group at 
Harvard University, they deployed UniRep to optimize the design 
of green fluorescent protein from Aegquorea victoria and TEM-1 
B-lactamase enzyme from E. co/z even from a limited pool of train- 
ing data [60]. Another study used neural networks trained to 
associate amino acids with the neighboring spatial orientation of 
carbon, oxygen, nitrogen, and sulfur atoms within a protein. By 
doing so, the researchers were able to identify novel gain-of-func- 
tion mutations and improve the protein function of three different 
proteins [61, 62]. 


2.2 Computational 
Tools for Metabolic 
Engineering 


Synbio x ML 29 


The ability to predict three-dimensional (3D) protein struc- 
tures from amino acid sequences is particularly useful when design- 
ing new proteins. Recently, this area of research experienced a 
breakthrough with the debut of DeepMind’s AlphaFold at the 
Critical Assessment of Protein Structure Prediction (CASP) com- 
petition, where it bested other well-established groups. AlphaFold 
now has a faster version, AlphaFold2, which also uses both neural 
networks and gradient descent [63]. Other algorithms that also 
perform the same function include RoseTTaFold, by David Baker’s 
group in the University of Washington [64] and by Mohammed 
AlQuraishi, an independent fellow in Systems Pharmacology at 
Harvard Medical School [65 ]. 

Predicting the secretability of proteins is particularly interesting 
to synthetic biologists as there are myriad applications that come 
from being able to secrete recombinant proteins from cells, includ- 
ing engineering therapeutic cells to deliver protein drugs [66, 67 | 
and large-scale industrial protein production [68]. The SECRiFY 
tool uses a gradient-boosted decision tree model and CNNs to 
predict the secretability of protein fragments by two yeast species, 
namely, S. cerevisiae and Pichia pastoris, from a custom fragment 
library, surface display, and a deep sequencing readout 
[69]. Another study described the development of the SignalP 
5.0 tool, which utilizes DNNs to make protein signal peptide 
predictions from amino acid sequences [41]. This improved tool 
can detect signal peptides across all domains of life and can distin- 
guish them across Gram-positive, Gram-negative, and archaea 
bacteria. 

These ML-based tools will allow synthetic biologists to effi- 
ciently design and optimize biological components for cell engi- 
neering. Leveraging these tools will enhance the evolution of novel 
and more complex cell designs in areas such as therapeutic cell 
development, precision diagnostics, and industrial biotechnology. 
While ML tools can never replace wet lab experiments, as experi- 
mental validation is still needed, the resulting predictions and 
insights can help accelerate conventional screening and iterative 
experimental procedures as well as guide the design and engineer- 
ing of biological components—reducing the time and resources 
needed to achieve the desired results. 


Broadly speaking, metabolic engineering builds upon the founda- 
tion of cell engineering. Instead of designing and controlling the 
expression of a single gene or the synthesis of a single protein, the 
subfield involves redesigning pathways that alter the metabolism of 
the engineered organism. Specifically, metabolic engineering 
involves modifying the natural chemical reactions of cells to focus 
on manufacturing desired biological compounds. This is often a 
multi-step process involving multiple enzymes. 


30 


Brendan Fu-Long Sieow et al. 


While there are numerous enzyme pathways and unique pro- 
ducts a cell can produce, they typically require the use of a small 
group of common metabolites or cofactors [70]. Accordingly, it is 
necessary to consider the broader cellular context when trying to 
optimize yields of a desired metabolite [71]. For instance, a single 
compound could be a product of many biochemical pathways 
[72]. Although high-yield pathways have been engineered 
through rational design [73-75 ], such efforts work best for simple 
pathways while also requiring detailed knowledge of the enzyme 
reactions involved and significant experimental work. These con- 
siderations have led to a growing interest in using computational 
techniques to approach metabolic engineering like an optimiza- 
tion problem. 

To engineer organisms for high-yield production, one must 
first identify a suitable series of chemical steps for converting sub- 
strates to the desired product. While early works focused on 
enhancing preexisting metabolite yields, advancements in genetic 
engineering technology have since enabled the de novo assembly of 
entire metabolic pathways into production hosts [76, 77]. 

One common method for generating pathways is to combine 
known enzymes into novel pathways using databases such as KEGG 
[78] and BRENDA [79]. These methods can be aided by tools that 
identify plausible pathways for selected substrate-product pairs 
[80]. While such methods have strong predictive power, their 
scope is limited by the selective use of manually curated enzyme 
reaction data. A broader alternative is to create pathways using 
generalized reaction rules, which considers the full biochemical 
reaction space [72]. Similar to cell engineering, these rules permit 
pathway prediction using inferred enzyme promiscuity, enabling 
designs that involve novel metabolites. Notably, Hadidi et al. cre- 
ated the ATLAS database as a repository of all theoretical enzyme 
reactions connecting KEGG metabolites [81 ]. 

However, the main drawback of this approach is its computa- 
tional intensity, though this can be reduced by using defined sets of 
rules to limit searches. With the gradual shift in methods towards 
de novo pathway design, the past years have seen the corresponding 
growth of computational tools and frameworks for this particular 
use case [82-84]. Moreover, it is also possible to suggest pathway 
steps using predicted enzyme activity retrieved from genome 
sequence mining [85]. The most widely used of these tools is 
antiSMASH, a data-based workflow for identifying biosynthetic 
gene clusters and predicting the chemical structure of its metabo- 
lites [86]. Tools based on hidden Markov models like PRISM have 
also been used to predict novel antibiotic compounds from genome 
databases [87 ]. 


Synbio x ML 31 


After an enzyme pathway is designed, the next step involves 
optimizing the host strain for metabolic engineering metrics like 
titer, rate of production, and yield (TRY). This is usually achieved 
by tuning the gene expression of both the synthetic and endoge- 
nous pathways. One approach is to create mathematical models that 
represent cellular processes to help identify potential production 
limitations. The most widely used of these techniques is known as 
flux balance analysis, which models the flow of chemical com- 
pounds through metabolic networks using stoichiometric rules 
[88]. The ubiquity of flux balance analysis can be attributed to its 
generality and computational affordability [89], with an increasing 
range of software applications being developed for this purpose 
[90]. Meanwhile, tools such as COBRA use computational frame- 
works to predict gene regulation or knockouts that increase the 
production of a target metabolite [91]. However, the reductionist 
nature of such analyses can limit their predictive power and possible 
use cases. 

A more quantitative alternative is to use mechanistic models to 
capture metabolic networks. For instance, chemical species can be 
described by using enzyme kinetics as a series of rate laws, which, in 
turn, can be analyzed as a series of ordinary differential equations. 
In doing so, these mechanistic models facilitate sensitivity analyses 
that help determine potential metabolic engineering targets for 
increasing TRY. As opposed to flux balance analysis (FBA), mecha- 
nistic modeling provides insights into isozyme activity and how 
metabolite pools may be affected by gene manipulation [92 ]. How- 
ever, the technique’s main drawback is its reliance on having char- 
acterized kinetic data for accurate predictions. While validated 
models are common for species like E. coli [93-95], characterizing 
large numbers of enzymes is a laborious and time-consuming effort 
that can be unfeasible for non-model organisms. 

In contrast, the other major approach for strain optimization is 
through data-driven ML analysis. Within the context of metabolic 
engineering, supervised ML algorithms do factor in multi-omics 
data and culture conditions to predict metabolite production. Since 
they do not require prior knowledge of the underlying biochemis- 
try, ML approaches are widely applicable for combinatorial pathway 
assembly [96] and improving TRY (Czajka et al. 2021). At present, 
one major challenge for ML in metabolic engineering lies in gen- 
erating large enough biological datasets required for training algo- 
rithms. To bridge this limitation, Radivojevic et al. developed ART, 
a ML tool combining network optimization and experimental 
design [97]. By recommending experimental designs to achieve 
the specified objective, the team demonstrated predictive modeling 
with as few as 19 built strains in a test cycle. 


32 


Brendan Fu-Long Sieow et al. 


Despite having fundamentally different bases, there is a devel- 
oping interest in integrating mechanistic models with ML. In prin- 
ciple, this leverages on the advantages of both approaches to 
provide data-driven predictions and insights into the underlying 
biology. For example, imposing model constraints based on 
biological context has been shown to increase prediction accuracy 
by disregarding biologically impossible solution spaces [98]. One 
direction being explored is the use of data obtained from mecha- 
nistic models as an input for ML. In silico flux predictions, for 
example, have been shown to increase the predictive power of ML 
in whole-genome models of yeast and cyanobacteria 
[99, 100]. Likewise, genome-scale models can be used to identify 
engineering targets and focus the scope of ML algorithms 
[101]. An alternate approach is to use ML to predict the parameters 
used in mechanistic models. As shown by Heckmann et al., enzyme 
turnover rates determined by ML models outperform naively 
assigned values at flux predictions [102]. 

Beyond engineering changes in the host organism, computa- 
tional methods can also improve TRY by optimizing bioreactor 
conditions. After all, microbial growth is influenced by environ- 
mental conditions including substrate availability, oxygen content, 
and pH—with any changes to these reactor parameters thereby 
affecting metabolite production. However, experimentally opti- 
mizing bioreactor cultures can be lengthy and costly. To this end, 
computational strategies to optimize bioreactor performance have 
been developed, including more contemporary neural network 
models that can suggest conditions for improving metabolite pro- 
duction [103]. Moreover, biochemical production is often carried 
out in high-volume bioreactors, which can have drastically different 
environments from small-scale lab cultures [104]. Design tools 
have since been adapted to accommodate some common limita- 
tions. For example, metabolic burden can be integrated in genome- 
scale models to account for the energy cost of expressing synthetic 
pathways [105]. Meanwhile, heterogeneous population modeling 
can be used to account for cell variability and stochasticity in 
bioreactors [106]. 

Ultimately, one of the goals of metabolic engineering is to 
integrate pathway design as well as the optimization of host strain 
and culture conditions into a combined pipeline (Fig. 3). Having a 
standard workflow increases reproducibility, helps reduce the time 
needed from project conceptualization to realization [107], and 
enables the use of experimental automation to increase throughput. 
Still, despite the advantages of having a comprehensive pipeline for 
metabolic engineering, there is a lack of published research describ- 
ing such approaches. This creates an opportunity for industries to 
develop proprietary workflows for engineering organisms for 
industrial applications and the academic sector to explore ways to 
incorporate ML algorithms in streamlining the engineering of 
biological pathways in organisms. 


Synbio x ML 33 


( a) Synthesis pathway Biosynthetic gene 


identification prediction Host strain selection 


ON litt Ge 


MECHANISTIC DATA-DRIVEN 


Flux balance analysis Col modeling Multi-omics analysis 


DNA ~ RNA-~ Protein 
XeY eZ 


ih 
{ Metabolic | ; ’ 
\ Engineering J) Ip } n 
\ 4 r= 4 Gene expression tuning Algorithmic prediction 
Bioprocess ‘ ee £N ; waa ° 
Optimization ass _- eee Melee 
e oO oO ° o e 


Cwr~« — 


Growth condition Maximize recovery & 
(c) optimization Production scaling purification 
if © la ih | 
ee ibs 6 | 
- rome] -_ => _ —__ = 
os. Sl) et 


Fig. 3 Applications of machine learning in metabolic engineering workflows. A metabolic engineering project 
can be broadly split into three phases: designing metabolic pathways, optimizing cells for production, and 
optimizing industrial processes for product yield. Many computational tools have been developed to guide 
design throughout the process. (a) Pathways for synthesizing target products can be designed using known 
biochemical reactions or predicted gene functions. This can help identify hosts with natural industrially 
relevant properties. (b) Strains are engineered to maximize production titer, rate, and yield. Mechanistic 
approaches use knowledge of the underlying biology to predict metabolite production. In contrast, data-driven 
methods identify patterns from large datasets to suggest improvements. Recent efforts have attempted to 
combine the two approaches to increase predictive power. (c) Downstream bioprocesses are optimized for 
high product output. In silico prediction can greatly reduce the time required to adapt a laboratory strain for 
industrial-scale production. (Created with Biorender.com) 


3 Conclusions and Future Outlooks 


Through the examples described in the chapter, it is apparent that 
ML is already shaping synthetic biology across several subfields. 
With the advancements in the development of computational 
tools for cell, protein, and metabolic engineering, ML is increas- 
ingly being integrated into workflows as a tool for providing richer, 


34 


References 


75(24):834-841 


Brendan Fu-Long Sieow et al. 


data-driven insights that can then smartly inform experimental 
approaches. Crucially, the use of ML techniques allows the genera- 
tion of predictive models even with limited knowledge and data on 
underlying mechanisms. By first modeling hypotheses in silico that 
can then be selectively tested through experiments, ML increases 
the speed and precision of biological research projects all while 
reducing the time and resources needed. As a result, larger libraries 
or combinatorial approaches that may otherwise be too trouble- 
some to study can be screened. 

Despite these initial advantages, there are several challenges 
that must be addressed to unleash the full potential of 
ML. Arguably, the largest issue lies in the difficulty of generating 
the large datasets required for training models. While ML 
approaches can facilitate biological design, much of the building 
and testing is done by hand and is hard to scale up. Additionally, the 
lack of a fixed standard in data collection and reporting across labs 
and institutions contributes to the difficulty in directly comparing 
data from multiple studies. While these challenges can be addressed 
by experimental automation (e.g., liquid handling systems) to a 
certain extent, such technologies remain costly. At present, training 
ML models may still require high computational costs, but the 
continuous improvement of computing power through the years 
may make the costs more manageable. 

Efforts among the community to make published models open- 
source are already making ML approaches more accessible to 
researchers. With the rise of computational and ML tools, we predict 
that they will play an integral role in future synthetic biology work- 
flows. Through these tools, we envisage the possibility of precise yet 
broadly applicable models that can facilitate de novo bioengineering 
with unparalleled efficiency. ML coupled with automation will allow 
the faster completion of research, from design to execution, with 
minimal human intervention. This would enable scientists and engi- 
neers to take on a more strategic role—shifting their time and focus 
to higher-value activities like planning and innovation. Such innova- 
tions are increasingly necessary in the coming years, with the global 
need for sustainable practices in manufacturing, medicine, engineer- 
ing, and agriculture. Computational tools have revolutionized syn- 
thetic biology over the past two decades, and we look forward to 
seeing where it can lead us to in the future. 


1. Hobom B (1980) Gene surgery: on the 4. Szostak JW, Bartel DP, Luisi PL (2001) 
threshold of synthetic biology. Med Nin Synthesizing life. Nature 409(6818): 
387-390 
2. Cameron DE, Bashor CJ, Collins JJ (2014) A 5. McAdams HH, Arkin A (2000) Gene regula- 
brief history of synthetic biology. Nat Rev tion: towards a circuit engineering discipline. 
Microbiol 12(5):381-390 Curr Biol 10(8):R318—RR20 
3. Benner SA, Sismour AM (2005) Synthetic 6. Ball P (2004) Synthetic biology: starting from 


biology. Nat Rev Genet 6(7):533-543 scratch. Nature 431(7009):624—-627 


13. 


14. 


15 


16. 


17. 


18. 


19. 


20. 


21. 


. Kwok R (2010) Five hard truths for synthetic 


biology. Nature News 463(7279):288-290 


. El Karoui M, Hoyos-Flight M, Fletcher L 


(2019) Future trends in synthetic biology—a 
report. Front Bioeng Biotechnol 7:175 


. Zhang R, Li C, Wang J, Yang Y, Yan Y (2018) 


Microbial production of small medicinal 
molecules and biologics: from nature to syn- 
thetic pathways. Biotechnol Adv 36(8): 
2219-2231 


. Stevens H (2013) Life out of sequence. Uni- 


versity of Chicago Press, Chicago/London 


. Markowetz F (2017) All biology is computa- 


tional biology. PLoS Biol 15(3):e2002050 


.Camacho DM, Collins KM, Powers RK, 


Costello JC, Collins JJ (2018) Next- 
generation machine learning for biological 
networks. Cell 173(7):1581-1592 
Domingos P (2012) A few useful things to 
know about machine learning. Commun 
ACM 55(10):78-87 

James G, Witten D, Hastie T, Tibshirani R 
(2013) An introduction to statistical learning. 
Springer New York, NY 


. Xie M, Haellman V, Fussenegger M (2016) 


Synthetic biology—application-oriented cell 
engineering. Curr Opin Biotechnol 40: 
139-148. https://doi.org/10.1016/j. 
copbio.2016.04.005 


Healy CP, Deans TL (2019) Genetic circuits 
to engineer tissues with alternative functions. 
J Biol Eng 13(1):39. https://doi.org/10. 
1186/s13036-019-0170-7 

Lawson CE, Marti JM, Radivojevic T, Jonna- 
lagadda SVR, Gentz R, Hillson NJ et al 
(2021) Machine learning for metabolic engi- 
neering: a review. Metab Eng 63:34-60. 
https: //doi.org/10.1016/j.ymben.2020. 
10.005 


Kotopka BJ, Smolke CD (2020) Model- 
driven generation of artificial yeast promoters. 
Nat Commun 11(1):2113. https://doi.org/ 
10.1038 /s41467-020-15977-4 


Van Brempt M, Clauwaert J, Mey F, Stock M, 
Maertens J, Waegeman Wet al (2020) Predic- 
tive design of sigma factor-specific promoters. 
Nat Commun 11(1):5822. https://doi.org/ 
10.1038 /s41467-020-19446-w 


Zhao M, Yuan Z, Wu L, Zhou S, Deng Y 
(2021) Precise prediction of promoter 
strength based on a de novo synthetic pro- 
moter library coupled with machine learning. 
ACS Synth Biol. https://doi.org/10.1021/ 
acssynbio.1c00117 

Jervis AJ, Carbonell P, Vinaixa M, Dunstan 
MS, Hollywood KA, Robinson CJ et al 
(2019) Machine learning of designed transla- 
tional control allows predictive pathway 


22. 


23. 


24. 


25. 


26. 


27: 


28. 


29. 


30. 


31. 


Synbio x ML 35 


optimization in Escherichia coli. ACS Synth 
Biol 8(1):127-136. https://doi.org/10. 
1021/acssynbio.8b00398 

Meng H, Wang J, Xiong Z, Xu F, Zhao G, 
Wang Y (2013) Quantitative design of regu- 
latory elements based on high-precision 
strength prediction using artificial neural net- 
work. PLoS One 8(4):e60288. https://doi. 
org/10.1371/journal.pone.0060288 

Salis HM, Mirsky EA, Voigt CA (2009) Auto- 
mated design of synthetic ribosome binding 
sites to control protein expression. Nat Bio- 
technol 27(10):946-950. https: //doi.org/ 
10.1038 /nbt.1568 


Leveau Johan HJ, Lindow SE (2001) Predic- 
tive and interpretive simulation of green fluo- 
rescent protein expression in reporter 
bacteria. J Bacteriol 183(23):6752-6762. 
https: //doi.org/10.1128/JB.183.23. 
6752-6762.2001 


Rhodius VA, Mutalik VK (2010) Predicting 
strength and function for promoters of the 
Escherichia coli alternative sigma factor, o°. 
Proc Natl Acad Sci 107(7):2854-2859. 
https://doi.org 710.1073 /7 puas. 
0915066107 


Jervis AJ, Carbonell P, Taylor S, Sung R, Dun- 
stan MS, Robinson CJ et al (2019) SelProm: a 
queryable and predictive expression vector 
selection tool for Escherichia coli. ACS 
Synth Biol 8(7):1478-1483. https://doi. 
org/10.1021 /acssynbio.8b00399 

Tunney R, McGlincy NJ, Graham ME, 
Naddaf N, Pachter L, Lareau LF (2018) 
Accurate design of translational output by a 
neural network model of ribosome distribu- 
tion. Nat Struct Mol Biol 25(7):577-582. 
https: //doi.org/10.1038/s41594-018- 
0080-2 


Cuperus JT, Groves B, Kuchina A, Rosenberg 
AB, Jojic N, Fields S et al (2017) Deep 
learning of the regulatory grammar of yeast 
5‘ untranslated regions from 500,000 random 
sequences. Genome Res 27(12):2015-2024. 
https: //doi.org/10.1101/gr.224964.117 
Kim GB, Gao Y, Palsson BO, Lee SY (2021) 
DeepTFactor: a deep learning-based tool for 
the prediction of transcription factors. Proc 
Natl Acad Sci 118(2):e2021171118. 
https://doi.org/10.1073/pnas. 
2021171118 

Karagiannis P, Fujita Y, Saito H (2016) 
RNA-based gene circuits for cell regulation. 
Proc Jpn Acad Ser B Phys Biol Sci 92(9): 
412-422. https://doi.org/10.2183/pjab. 
92.412 

Chau THT, Mai DHA, Pham DN, Le HTQ, 
Lee EY (2020) Developments of riboswitches 
and toehold switches for molecular detection- 


36 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


Brendan Fu-Long Sieow et al. 


biosensing and molecular diagnostics. Int J 
Mol Sci 21(9):3192. https://doi.org/10. 
3390/ijms2 1093192 

Angenent-Mari NM, Garruss AS, Soenksen 
LR, Church G, Collins JJ (2020) A deep 
learning approach to programmable RNA 
switches. Nat Commun _— 11(1):5057. 
https: //doi.org/10.1038/s41467-020- 
18677-1 

Valeri JA, Collins KM, Ramesh P, Alcantar 
MA, Lepe BA, Lu TK et al (2020) Sequence- 
to-function deep learning frameworks for 
engineered riboregulators. Nat Commun 
11(1):5058. https://doi.org/10.1038/ 
s41467-020-18676-2 


Chari R, Yeo NC, Chavez A, Church GM 
(2017) sgRNA scorer 2.0: a species- 
independent model to predict CRISPR/ 
Cas9 activity. ACS Synth Biol 6(5):902-904. 
https: //doi.org/10.1021/acssynbio. 
6b00343 

Doench JG, Fusi N, Sullender M, Hegde M, 
Vaimberg EW, Donovan KF et al (2016) 
Optimized sgRNA design to maximize activ- 
ity and minimize off-target effects of 
CRISPR-Cas9. Nat Biotechnol 34(2): 
184-191. https://doi.org/10.1038/nbt. 
3437 

Kim HK, Min S, Song M, Jung S, Choi JW, 
Kim Y et al (2018) Deep learning improves 
prediction of CRISPR-Cpfl guide RNA 
activity. Nat Biotechnol 36(3):239-241. 
https: //doi.org/10.1038 /nbt.4061 
Listgarten J, Weinstein M, Kleinstiver BP, 
Sousa AA, Joung JK, Crawford J et al (2018) 
Prediction of off-target activities for the end- 
to-end design of CRISPR guide RNAs. Nat 
Biomed Eng 2(1):38-47. https://doi.org/ 
10.1038 /s41551-017-0178-6 

Lin J, Wong K-C (2018) Off-target predic- 
tions in CRISPR-Cas9 gene editing using 
deep learning. Bioinformatics 34(17): 
1656-1163. https://doi.org/10.1093/bioin 
formatics/bty554 

Finn RD, Clements J, Eddy SR (2011) 
HMMER web server: interactive sequence 
similarity searching. Nucleic Acids Res 39 
(suppl_2):W29-W37. https://doi.org/10. 
1093 /nar/gkr367 

Yoon B-J (2009) Hidden Markov models and 
their applications in biological sequence anal- 
ysis. Curr Genomics 10(6):402-415. https: // 
doi.org/10.2174/138920209789177575 
Almagro Armenteros JJ, Tsirigos KD, Son- 
derby CK, Petersen TN, Winther O, Brunak 
S et al (2019) SignalP 5.0 improves signal 
peptide predictions using deep neural 


42. 


43. 


44, 


45. 


46. 


47. 


48. 


49, 


50. 


51. 


networks. Nat Biotechnol 37(4):420-423. 
https: //doi.org/10.1038/s41587-019- 
0036-z 


Clauwaert J, Menschaert G, Waegeman W 
(2019) DeepRibo: a neural network for pre- 
cise gene annotation of prokaryotes by com- 
bining ribosome profiling signal and binding 
site patterns. Nucleic Acids Res 47(6):e36-e. 
https: //doi.org/10.1093 /nar/gkz061 


Ryu JY, Kim HU, Lee SY (2019) Deep 
learning enables high-quality and high- 
throughput prediction of enzyme commission 
numbers. Proc Natl Acad Sci 116(28):13996. 
https://doi.org/10.1073/pnas. 
1821905116 


Ndah E, Jonckheere V, Giess A, Valen E, 
Menschaert G, Van Damme P (2017) REPA- 
RATION: ribosome profiling assisted (re-) 
annotation of bacterial genomes. Nucleic 
Acids Res 45(20):e168-e. https://doi.org/ 
10.1093 /nar/gkx758 

Yu C, Zavaljevski N, Desai V, Reifman J 
(2009) Genome-wide enzyme annotation 
with precision control: catalytic families (Cat- 
Fam) databases. Proteins 74(2):449-460. 
https: //doi.org/10.1002/prot.22167 

Li Y, Wang S, Umarov R, Xie B, Fan M, Li L 
et al (2018) DEEPre: sequence-based enzyme 
EC number prediction by deep learning. Bio- 
informatics 34(5):760-769. https: //doi.org/ 
10.1093 /bioinformatics /btx680 


Nursimulu N, Xu LL, Wasmuth JD, Krukov I, 
Parkinson J (2018) Improved enzyme anno- 
tation with EC-specific cutoffs using 
DETECT v2. Bioinformatics 34(19): 
3393-3395. https: //doi.org/10.1093 /bioin 
formatics/bty368 


Dalkiran A, Rifaioglu AS, Martin MJ, Cetin- 
Atalay R, Atalay V, Dogan T (2018) ECPred: 
a tool for the prediction of the enzymatic 
functions of protein sequences based on the 
EC nomenclature. BMC Bioinformatics 
19(1):334. https://doi.org/10.1186/ 
s12859-018-2368-y 

Kumar N, Skolnick J (2012) EFICAz2.5: 
application of a high-precision enzyme func- 
tion predictor to 396 proteomes. Bioinfor- 
matics 28(20):2687-2688. https: //doi.org/ 
10.1093 /bioinformatics/bts5 10 
Claudel-Renard C, Chevalet C, Faraut T, 
Kahn D (2003) Enzyme-specific profiles for 
genome annotation: PRIAM. Nucleic Acids 
Res 31(22):6633-6639. https://doi.org/10. 
1093 /nar/gkg847 

Alderson RG, De Ferrari L, Mavridis L, 
McDonagh JL, Mitchell JBO, Nath N 
(2012) Enzyme informatics. Curr Top Med 


52. 


53. 


54. 


55. 


56. 


57. 


58. 


59. 


60. 


ol. 


Chem 12(17):1911-1923. https://doi.org/ 
10.2174/156802612804547353 


Faulon J-L, Misra M, Martin S, Sale K, Sapra 
R (2008) Genome scale enzyme—metabolite 
and drug-—target interaction predictions using 
the signature molecular descriptor. Bioinfor- 
matics 24(2):225-233. https://doi.org/10. 
1093 /bioinformatics/btm580 

Mellor J, Grigoras I, Carbonell P, Faulon J-L 
(2016) Semisupervised Gaussian process for 
automated enzyme search. ACS Synth Biol 
5(6):518-528. https://doi.org/10.1021/ 
acssynbio.5b00294 


Yang KK, Wu Z, Arnold FH (2019) Machine- 
learning-guided directed evolution for pro- 
tein engineering. Nat Methods 16(8): 
687-694. https://doi.org/10.1038/ 
s41592-019-0496-6 

Wu Z, Kan SBJ, Lewis RD, Wittmann BJ, 
Arnold FH (2019) Machine learning-assisted 
directed protein evolution with combinatorial 
libraries. Proc Natl Acad Sci 116(18):8852. 
https://doi.org/10.1073/pnas. 
1901979116 


Fox RJ, Davis SC, Mundorff EC, Newman 
LM, Gavrilovic V, Ma SK et al (2007) Improv- 
ing catalytic function by ProSAR-driven 
enzyme evolution. Nat Biotechnol 25(3): 
338-344. https://doi.org/10.1038/ 
nbt1286 

Saito Y, Oikawa M, Nakazawa H, Niide T, 
Kameda T, Tsuda K et al (2018) Machine- 
learning-guided mutagenesis for directed evo- 
lution of fluorescent proteins. ACS Synth Biol 
7(9):2014—2022. https://doi.org/10.1021/ 
acssynbio.8b00155 

Romero PA, Krause A, Arnold FH (2013) 
Navigating the protein fitness landscape with 
Gaussian processes. Proc Natl Acad Sci 
110(3):E193. https://doi.org/10.1073/ 
pnas.1215251110 

Suzek BE, Wang Y, Huang H, McGarvey PB, 
Wu CH, the UniProt C (2015) UniRef clus- 
ters: a comprehensive and scalable alternative 
for improving sequence similarity searches. 
Bioinformatics 31(6):926-932. https://doi. 
org/10.1093 /bioinformatics/btu7 39 

Biswas S, Khimulya G, Alley EC, Esvelt KM, 
Church GM (2021) Low-N protein engineer- 
ing with data-efficient deep learning. Nat 
Methods 18(4):389-396. https://doi.org/ 
10.1038 /s41592-021-01100-y 

Shroff R, Cole AW, Diaz DJ, Morrow BR, 
Donnell I, Annapareddy A et al (2020) Dis- 
covery of novel gain-of-function mutations 
guided by structure-based deep learning. 


62. 


63. 


64. 


65. 


66. 


67. 


68. 


69. 


70. 


71. 


Synbio x ML 37 


ACS Synth Biol 9(11):2927-2935. https:// 
doi.org/10.1021/acssynbio.0c00345 

Torng W, Altman RB (2017) 3D deep con- 
volutional neural networks for amino acid 
environment similarity analysis. BMC Bioin- 
formatics 18(1):302. https://doi.org/10. 
1186/s12859-017-1702-0 

Jumper J, Evans R, Pritzel A, Green T, 
Figurnov M, Ronneberger O et al (2021) 
Highly accurate protein structure prediction 
with AlphaFold. Nature 596(7873): 
583-589. https://doi.org/10.1038/ 
s41586-021-03819-2 

Baek M, DiMaio F, Anishchenko I, 
Dauparas J, Ovchinnikov S, Lee Gyu R et al 
(2021) Accurate prediction of protein struc- 
tures and interactions using a three-track neu- 
ral network. Science 373(6557):871-876. 
https: //doi.org/10.1126/science.abj8754 
AlQuraishi M (2019) End-to-end differentia- 
ble learning of protein structure. Cell Syst 
8(4):292-301.e3. https://doi.org/10. 
1016/j.cels.2019.03.006 

Sieow BF-L, Wun KS, Yong WP, Hwang IY, 
Chang MW (2021) Tweak to treat: repro- 
graming bacteria for cancer treatment. Trends 
Cancer 7(5):447-464. https://doi.org/10. 
1016/j.trecan.2020.11.004 

Chua KJ, Kwok WC, Aggarwal N, Sun T, 
Chang MW (2017) Designer probiotics for 
the prevention and treatment of human dis- 
eases. Curr Opin Chem Biol 40:8-16. 
https: //doi.org/10.1016/j.cbpa.2017. 
04.011 

Huang M, Wang G, Qin J, Petranovic D, 
Nielsen J (2018) Engineering the protein 
secretory pathway of Saccharomyces cerevisiae 
enables improved protein production. Proc 
Natl Acad Sci U S A_ 115(47): 
E11025-Elle32. https://doi.org/10. 
1073/pnas.1809921115 

Boone M, Ramasamy P, Zuallaert J, 
Bouwmeester R, Van Moer B, Maddelein D 
et al (2021) Massively parallel interrogation of 
protein fragment secretability using SECRiFY 
reveals features influencing secretory system 
transit. Nat Commun 12(1):6414. https:// 
doi.org/10.1038/s41467-021-26720-y 
Csete M, Doyle J (2004) Bow ties, metabo- 
lism and disease. Trends Biotechnol 22(9): 
446-450. https://doi.org/10.1016/j. 
tibtech.2004.07.007 

Stephanopoulos G, Vallino JJ (1991) Net- 
work rigidity and metabolic engineering in 
metabolite overproduction. Science 
252(5013):1675-1681. https://doi.org/10. 
1126/science.1904627 


38 


72 


73. 


74, 


75. 


76. 


77. 


78. 


79. 


80. 


81. 


Brendan Fu-Long Sieow et al. 


. Hatzimanikatis V, Li C, Ionita JA, Henry CS, 
Jankowski MD, Broadbelt LJ (2005) Explor- 
ing the diversity of complex metabolic net- 
works. Bioinformatics 21(8):1603-1609. 
https: //doi.org/10.1093 /bioinformatics/ 
bti213 

Patnaik R, Liao JC (1994) Engineering of 
Escherichia coli central metabolism for aro- 
matic metabolite production with near theo- 
retical yield. Appl Environ Microbiol 60(11): 
3903-3908 

Koffas MAG, Jung GY, Stephanopoulos G 
(2003) Engineering metabolism and product 
formation in Corynebacterium glutamicum 
by coordinated gene overexpression. Metab 
Eng 5(1):32-41. https://doi.org/10.1016/ 
$1096-7176(03)00002-8 

Nakamura CE, Whited GM (2003) Metabolic 
engineering for the microbial production of 
1,3-propanediol. Curr Opin Biotechnol 
14(5):454-459. https: //doi.org/10.1016/j. 
copbio.2003.08.005 


Ro D-K, Paradise EM, Ouellet M, Fisher KJ, 
Newman KL, Ndungu JM et al (2006) Pro- 
duction of the antimalarial drug precursor 
artemisinic acid in engineered yeast. Nature 
440(7086):940-943. https://doi.org/10. 
1038 /nature04640 

Hansen EH, Moller BL, Kock GR, Biinner 
CM, Kristensen C, Jensen OR et al (2009) 
De novo biosynthesis of vanillin in fission 
yeast (Schizosaccharomyces pombe) and 
Baker’s yeast (Saccharomyces cerevisiae). 
Appl Environ Microbiol 75(9):2765-2774. 
https: //doi.org/10.1128 /AEM.02681-08 


Kanehisa M, Goto S (2000) KEGG: Kyoto 
encyclopedia of genes and genomes. Nucleic 
Acids Res 28(1):27-30. https://doi.org/10. 
1093 /nar/28.1.27 

Schomburg I, Chang A, Schomburg D 
(2002) BRENDA, enzyme data and meta- 
bolic information. Nucleic Acids Res 30(1): 
47-49. https://doi.org/10.1093/nar/30. 
1.47 

Moriya Y, Shigemizu D, Hattori M, 
Tokimatsu T, Kotera M, Goto S et al (2010) 
PathPred: an enzyme-catalyzed metabolic 
pathway prediction server. Nucleic Acids Res 
38(suppl_2):W138-WW43. https://doi.org/ 
10.1093 /nar/gkq318 

Hadadi N, Hafner J, Shajkofci A, Zisaki A, 
Hatzimanikatis V (2016) ATLAS of biochem- 
istry: a repository of all possible biochemical 
reactions for synthetic biology and metabolic 
engineering studies. ACS Synth Biol 5(10): 
1155-1166. https://doi.org/10.1021/ 
acssynbio.6b00054 


82. 


83. 


84. 


85. 


86. 


87. 


88. 


89. 


90. 


91. 


92. 


Delépine B, Duigou T, Carbonell P, Faulon 
J-L (2018) RetroPath2.0: a retrosynthesis 
workflow for metabolic engineers. Metab 
Eng 45:158-170 


Kumar A, Wang L, Ng CY, Maranas CD 
(2018) Pathway design using de novo steps 
through uncharted biochemical spaces. Nat 
Commun 9(1):184. https://doi.org/10. 
1038 /s41467-017-02362-x 


Wang L, Dash S, Ng CY, Maranas CD (2017) 
A review of computational tools for design 
and reconstruction of metabolic pathways. 
Synth Syst Biotechnol 2(4):243-252 


Sieow BFL, Nurminen TJ, Ling H, Chang 
MW (2019) Meta-omics-and metabolic 
modeling-assisted deciphering of human 
microbiota metabolism. Biotechnol J 14(9): 
1800445 


Blin K, Shaw S, Steinke K, Villebro R, 
Ziemert N, Lee SY et al (2019) antiSMASH 
5.0: updates to the secondary metabolite 
genome mining pipeline. Nucleic Acids Res 
47(W1):W81-WW7. https://doi.org/10. 
1093 /nar/gkz310 

Skinnider MA, Johnston CW, 
Gunabalasingam M, Merwin NJ, Kieliszek 
AM, MacLellan RJ et al (2020) Comprehen- 
sive prediction of secondary metabolite struc- 
ture and biological activity from microbial 
genome sequences. Nat Commun 11(1): 
6058. https://doi.org/10.1038 /s41467- 
020-19986-1 

Orth JD, Thiele I, Palsson B® (2010) What is 
flux balance analysis? Nat Biotechnol 28(3): 
245-248. https://doi.org/10.1038/nbt. 
1614 


Sahu A, Blatke M-A, Szymanski JJ, Topfer N 
(2021) Advances in flux balance analysis by 
integrating machine learning and 
mechanism-based models. Comput Struct 
Biotechnol J 19:4626-4640. https://doi. 
org/10.1016/j.csbj.2021.08.004 
Lakshmanan M, Koh G, Chung BKS, Lee 
D-Y (2012) Software applications for flux bal- 
ance analysis. Brief Bioinform 15(1): 
108-122. https://doi.org/10.1093/bib/ 
bbs069 

Heirendt L, Arreckx S, Pfau T, Mendoza SN, 
Richelle A, Heinken A et al (2019) Creation 
and analysis of biochemical constraint-based 
models using the COBRA Toolbox v.3.0. Nat 
Protoc 14(3):639-702. https://doi.org/10. 
1038 /s41596-018-0098-2 

Foster CJ, Wang L, Dinh HV, Suthers PF, 
Maranas CD (2021) Building kinetic models 
for metabolic engineering. Curr Opin 


93. 


94. 


95. 


96. 


97. 


98. 


99. 


100. 


Biotechnol 67:35-41. https://doi.org/10. 
1016/j.copbio.2020.11.010 


Khodayari A, Zomorrodi AR, Liao JC, Mar- 
anas CD (2014) A kinetic model of Escher- 
ichia coli core metabolism satisfying multiple 
sets of mutant flux data. Metab Eng 25: 
50-62. https://doi.org/10.1016/j.ymben. 
2014.05.014 


Khodayari A, Chowdhury A, Maranas CD 
(2015) Succinate overproduction: a case 
study of computational strain design using a 
comprehensive Escherichia coli kinetic model. 
Front Bioeng Biotechnol 2(76). https://doi. 
org/10.3389 /fbioe.2014.00076 


Khodayari A, Maranas CD (2016) A genome- 
scale Escherichia coli kinetic metabolic model 
k-ecoli457 satisfying flux data for multiple 
mutant strains. Nat Commun 7(1):13806. 
https: //doi.org/10.1038 /ncomms13806 
Zhou Y, Li G, Dong J, Xing X-h, Dai J, Zhang 
C (2018) MiYA, an efficient machine-learning 
workflow in conjunction with the YeastFab 
assembly strategy for combinatorial optimiza- 
tion of heterologous metabolic pathways in 
Saccharomyces cerevisiae. Metab Eng 47: 
294-302 

Radivojevic T, Costello Z, Workman K, Mar- 
tin HG (2020) A machine learning Auto- 
mated Recommendation Tool for synthetic 
biology. Nat Commun 11(1):1-14 


Wu SG, Wang Y, Jiang W, Oyetunde T, Yao R, 
Zhang X et al (2016) Rapid prediction of 
bacterial heterotrophic fluxomics using 
machine learning and constraint program- 
ming. PLoS Comput Biol 12(4):c1004838. 
https: //doi.org/10.1371/journal.pcbi. 
1004838 

Culley C, Vijayakumar S, Zampieri G, 
Angione C (2020) A mechanism-aware and 
multiomic machine-learning pipeline charac- 
terizes yeast cell growth. Proc Natl Acad Sci 
117(31):18869-18879. https://doi.org/10. 
1073/pnas.2002959117 


Vijayakumar S, Rahman PKSM, Angione C 
(2020) A hybrid flux balance analysis and 
machine learning pipeline _ elucidates 


101 


102. 


103. 


104 


105 


106. 


107. 


. Zhang J, 


Synbio x ML 39 


metabolic adaptation in cyanobacteria. 
iScience 23(12):101818. https://doi.org/ 
10.1016/j.isci.2020.101818 


Petersen SD, Radivojevic  T, 
Ramirez A, Pérez-Manriquez A, Abeliuk E 
et al (2020) Combining mechanistic and 
machine learning models for predictive engi- 
neering and optimization of tryptophan 
metabolism. Nat Commun 11(1):4880. 
https: //doi.org/10.1038/s41467-020- 
17910-1 

Heckmann D, Lloyd CJ, Mih N, Ha Y, Zie- 
linski DC, Haiman ZB et al (2018) Machine 
learning applied to enzyme turnover numbers 
reveals protein structural correlates and 
improves metabolic models. Nat Commun 
9(1):5252. https://doi.org/10.1038/ 
s41467-018-07652-6 

Roubos JA, van Straten G, van Boxtel AJB 
(1999) An evolutionary strategy for 
fed-batch bioreactor optimization; concepts 
and performance. J Biotechnol 67(2): 
173-187. https://doi.org/10.1016/S0168- 
1656(98)00174-6 


. Lara AR, Galindo E, Ramirez OT, Palomares 


LA (2006) Living with heterogeneities in 
bioreactors. Mol Biotechnol 34(3):355-381. 
https: //doi.org/10.1385/MB:34:3:355 


. Wu G, Yan Q, Jones JA, Tang YJ, Fong SS, 


Koffas MAG (2016) Metabolic burden: cor- 
nerstones in synthetic biology and metabolic 
engineering applications. Trends Biotechnol 
34(8):652-664. https: //doi.org/10.1016/j. 
tibtech.2016.02.010 


Lencastre Fernandes R, Nierychlo M, 
Lundin L, Pedersen AE, Puentes Tellez PE, 
Dutta A et al (2011) Experimental methods 
and modeling techniques for description of 
cell population heterogeneity. Biotechnol 
Adv 29(6):575-599. https://doi.org/10. 
1016/j.biotechadv.2011.03.007 
Otero-Muras I, Carbonell P (2021) Auto- 
mated engineering of synthetic metabolic 
pathways for efficient biomanufacturing. 
Metab Eng 63:61-80. https://doi.org/10. 
1016/j.ymben.2020.11.012 


Check for 
updates 


Design and Analysis of Massively Parallel Reporter Assays 
Using FORECAST 


Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Abstract 


Machine learning is revolutionizing molecular biology and bioengineering by providing powerful insights 
and predictions. Massively parallel reporter assays (MPRAs) have emerged as a particularly valuable class of 
high-throughput technique to support such algorithms. MPRAs enable the simultaneous characterization 
of thousands or even millions of genetic constructs and provide the large amounts of data needed to train 
models. However, while the scale of this approach is impressive, the design of effective MPRA experiments 
is challenging due to the many factors that can be varied and the difficulty in predicting how these will 
impact the quality and quantity of data obtained. Here, we present a computational tool called FORE- 
CAST, which can simulate MPRA experiments based on fluorescence-activated cell sorting and subsequent 
sequencing (commonly referred to as Flow-seq or Sort-seq experiments), as well as carry out rigorous 
statistical estimation of construct performance from this type of experimental data. FORECAST can be used 
to develop workflows to aid the design of MPRA experiments and reanalyze existing MPRA data sets. 


Key words Massively parallel reporter assay, Cell sorting, Sequencing, Inference, Experimental 
design, Bioinformatics, Synthetic biology 


1 ‘Introduction 


In order to effectively engineer biology, it is necessary to under- 
stand the complex relationship between DNA sequence and 
function [1-7]. However, this relationship is notoriously hard to 
map out due to the high-dimensional structure of the underlying 
functional landscape. Data-driven approaches offer a means to 
address this challenge but require large amounts of training data 
that can be costly to generate experimentally [8]. To tackle this 
issue, new massively parallel reporter assays (MPRAs) have become 
popular, allowing for large-scale genotype-to-phenotype maps to 
be produced [9-18]. The most prominent MPRA methods are 
Flow-/Sort-seq-based techniques [19] where vast libraries of 
genetic constructs are designed to control the expression of a 
fluorescent reporter (normally a fluorescent protein). Large pools 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_3, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


41 


42 Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Diverse genotypes 


— , =o 
ccm 


acti raredac a Flow/Sor-seq ——_—__——— Machine learning 


—— ! 
=—_—s eas] Py f f f = O H 
= _ So 7 ' 
nye Hl=@ | 
Combinatorial \ \ 1 fh SAPS 
Ce! > [x28] p H}=@ > -ofess 
assembly gd eo! 3 he we 
An toto Aaa ty 1220 | Fluorescence- DNA Genotype-! Deep neutral 
f AA I | = activated cell sequencing Phenotype! networks, etc. 
Cells! sorting (FACS) map | 


Oligo synthesis 


Fig. 1 Overview of Flow-/Sort-seq-based massively parallel reporter assays (MPRAs). Diverse genetic 
constructs (genotypes) can be generated in several different ways. Common examples include random pooled 
combinatorial assembly and massively parallel oligo synthesis. Each genotype must have a fluorescence- 
based phenotype when placed in a living cell. Flow-/Sort-seq then takes a mixed pool of cells transformed 
with this genotype library and uses fluorescence-activated cell sorting (FACS) to separate cells with similar 
fluorescence values into a discrete set of bins. Constructs in each bin are barcoded and then sequenced to 
allow for the genotypes present in each bin to be recovered. This then provides the genotype-to-phenotype 
data necessary for training machine learning models 


of these genetic constructs are assembled and inserted into living 
cells. Fluorescence-activated cell sorting and sequencing is then 
used to estimate the phenotype (i.e., fluorescence level) of each 
genetic construct (Fig. 1). This approach has enabled the parallel 
measurement of up to 100 million genetic constructs at a 
time [20]. 

Genomic regulation has been increasingly studied in this way with 
MPRA data being combined with machine learning algorithms 
(e.g. recurrent or convolutional neural networks [21—24]) to derive 
predictive models that can be subsequently used for forward design 
[20, 25-29]. To further improve model performance, efforts have 
mostly been geared towards collecting larger amounts of data 
[12, 20] or developing more flexible machine learning algorithms 
that learn more quickly [30, 31]. However, an aspect that has been 
overlooked is the design of the MPRA experiments themselves and the 
quality of the data produced. Every MPRA experiment has many 
parameters in its design, and all of these can affect the accuracy of the 
inferred results. Therefore, there is a need for systematic approaches to 
explore MPRA experimental design space and for more rigorous data 
analysis methods to ensure suboptimal decisions are avoided. 

In this chapter, we present a computational tool called FORE- 
CAST that aims to address this problem. FORECAST is a Python 
package that can accurately simulate Flow-/Sort-seq-based MPRA 
experiments and provide maximum likelihood (ML) and method of 
moments (MoM)-based estimators for the accurate inference of 
construct performance from MPRA data. Here, we demonstrate 
how these features can be used to explore the design parameters of 
an MPRA experiment and ensure the robust analysis of MPRA- 
based data. 


FORECAST: Design and Analysis of MPRA Experiments 43 


2 Materials 


2.1 Software FORECASE requires that the following software tools and 
Dependencies packages are installed and accessible from the command line. 


1. Python version 3.7 or later (we recommend a distribution like 
Anaconda). 


2. Git version 2.20 or later. 


3. Conda version 4.0 or later. 


2.2 Installation 1. Acopy of the latest version of FORECAST can be downloaded 
by running the command: 


git clone https://gitlab.com/Pierre-Aurelien/forecast.git 


2. This will create a directory called “forecast” with the structure 
shown in Fig. 2. It is crucial that all commands are executed 
from within the root of this directory. To move to this direc- 
tory, use the command: 


cd forecast 


forecast (root install location) 


LICENSE 
README .md 


forecast 

__init__.py 
run_simuLation.py 
run_inference.py 
data 
figure 
investigation 
protocol 
util 

L__test 

t--- out 


Fig. 2 Structure of the FORECAST code repository. Directories are shown in bold. 
Each directory in the “forecast” directory has a key function: “data” holds data 
samples used for simulation; “figure” contains scripts to produce output plot; 
“investigation” holds scripts capturing analyses that are possible; “protocol” 
contains the steps to model the MPRA experiment during simulation; and “util” 
holds general utility functions. The “test” directory is used for testing functions, 
and the dashed “out” directory is created upon execution of a simulation or 
analysis script and used to hold directories containing the outputs from each of 
these 


44 Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


3 Methods 


3.1. Simulation of 
MPRA Experiments 


3. We recommend creating a conda environment called “forecast” 
to house the additional Python packages that are required. This 
can be done by running the command: 


conda env create -f environment.yml 

4. This new environment must then be activated using: 

conda activate forecast 

5. Finally, all Python dependencies can be installed by running: 
pip install -e . 


6. FORECAST is now ready to be used (see Note 1). 


FORECAST is a Python package that provides several command- 
line tools to aid the simulation and analysis of Flow-/Sort-seq- 
based MPRA data (herein referred to as MPRA data). In the 
following sections, we outline each of the available commands 
and the scope of their use when simulating MPRA experiments or 
analyzing MPRA data. 


FORECAST can generate biologically realistic MPRA data by simu- 
lating each of the steps in an MPRA experiment. This includes cell 
sorting, PCR amplification during sequencing, library preparation, 
and finally sequencing. To do so, it requires input files providing 
key parameters for the fluorescence output distributions of each 
construct in the library to be simulated. 


1. To provide the parameters for the output fluorescence distri- 
bution of each construct in the library, a tab-delimited comma- 
separated values (CSV) file is required. We allow for either 
gamma or lognormal distributions to be used, with the shape 
a and scale 6 or mean yp and standard deviation o parameters 
provided, respectively. Each row in the CSV file corresponds to 
a separate construct (genotype) with the construct ID, a or yp, 
and finally 4 or o, depending upon the distribution type 
(gamma or lognormal, respectively). If using a gamma distri- 
bution, the file must be named “library_gamma.csv,” and if 
using a lognormal distribution, the file must be named “librar- 
y_lognormal.csv.” We provide examples of both types of these 
files derived from real MPRA experiments [12, 32] in the 
“data/gamma” and “data/lognormal” directories. These are 
used by default by all FORECAST commands unless a user 
specified library is provided. 


3.2 Inferring 
Construct 
Performance from 
MPRA Data 


FORECAST: Design and Analysis of MPRA Experiments 45 


2. Once the distributions for constructs in a library are available as 
an appropriately formatted CSV file, simulation of an MPRA 
experiment can be performed. This is carried out using the 
command: 


generate auto_bin 


This will create a directory called “simulation_” followed 
by the date and time in the “out” directory. Three CSV files are 
created within this directory. The first “cells_bins.csv” is a 
tab-delimited CSV file containing a single row, where each 
entry represents the number of cells sorted per bin. The second 
“sequencing.csv” is also a tab-delimited table where each row 
corresponds to a different genetic construct, and each column 
indicates the read counts per bin. The final “metadata_simula- 
tion.csv” is a tab-delimited table detailing the parameters used 
to conduct the simulation. Examples of these files can be found 
in the “data/flow_seq” directory. 


3. By default, the “generate” command will use the data and 
options specified in Table 1. These can be individually altered 
to enable user-defined simulations. For example, the following 
command: 


generate auto_bin --bins 8 --reads 1e8 


will simulate a MPRA experiment using the default gamma- 
distributed construct library with 8 log-spaced bins and 
100 million sequencing reads. 


Once MPRA data has been generated computationally (as in Sub- 
heading 3.1) or measured from experiment, it is necessary to infer 
the performance (i.e., average fluorescence) of each genetic con- 
struct. FORECAST can generate estimates using both maximum 
likelihood (ML) and method of moments (MoM) approaches 
assuming either a gamma (the default) or log normal distributions 
for the underlying data. 


1. To analyze MPRA data, it must be in a compatible format. 
Specifically, FORECAST requires a directory containing two 
CSV files: (1) “cell_bins.csv” that contains a tab-delimited list 
(m elements long) of the number of cells sorted into each of the 
m bins and (2) “sequencing.csv” that contains rows 
corresponding to each construct (unique genotype) and col- 
umns where the first denotes the “ID” of the construct and the 
following 7 columns contain the number of reads recovered for 
that construct in each bin after cell sorting. The format of these 
files is identical to those generated by the simulator described 
in Subheading 3.1, allowing simulated data to be immediately 
analyzed. 


46 Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Table 1 
Command-line options for the generate command 


Flag Type Default Description 


--distribution String gamma Fluorescence distribution: 
“gamma” or “lognormal” 


== (SAL PAG) Float le6 Number of cells sorted 
--reads Float 1e5 Total number of sequencing 
reads 
--ratio_amplification Float le2 PCR amplification ratio 
--bias_library Boolean FALSE Allow for a different number 
of cells for each construct 
--metadata_path Path forecast/data/gamma Path to construct distributions 
==Owle_eicla Path out/ Path for output files 


simulation DATETIME 
15 _ Gini) Integer 1 Fluorescence per protein 
auto_bin* 


=S1E _ ime Float 1e5 Maximum measurable 
fluorescence value 


--bins Integer 12 Number of logarithmically 
spaced bins used for sorting 


custom_bin* 


--upper_bounds List WeP alte sme: Upper fluorescence bounds of 
each bin 
First upper fluorescence 
bound must be greater than 
lau. 


“The auto_binand custom_bin options are mutually exclusive and should be specified directly after the p Lot 
keyword. Flags specific to their operation are provided below each option in the table 


2. Inference can then be performed by running: 


infer auto_bin --metadata_path DATA_PATH 


Here, “DATA_PATH” is the path to the input files 
described in the previous step, and if no “--metadata_path” 
option is provided, the latest simulation output from the “out” 
directory will be used. This command will create a directory 
called “inference_” followed by the date and time in the “out” 
directory. This new directory will contain four CSV files: (1) a 
copy of the input “cells_bins.csv” file; (2) a copy of the 
“sequencing.csv” file; (3) a tab-delimited file named “results. 
csv” holding the results from the inference step (each row in 
“results.csv” corresponds to a single genetic construct; 


Table 2 


FORECAST: Design and Analysis of MPRA Experiments 47 


Column descriptions for result files generated by the infer command 


Column name 


Description 


a_mle 
b_mle 
a_se 
b_se 
a_mom 
b_mom 
mean 


st_dev 


inference_grade 


score 
mu_mle 
sigma_mle 
mu_se 


sigma_se 


Maximum likelihood (ML) estimate of the shape parameter 

ML estimate of the scale parameter 

ML estimate of the standard error for the shape parameter 

ML estimate of the standard error for the scale parameter 

Method of moments (MoM) estimate of the shape parameter 

MoM estimate of the scale parameter 

Fluorescence mean. Log fluorescence mean if using the lognormal distribution 


Fluorescence standard deviation. Log fluorescence standard deviation if using 
the lognormal distribution 


Quality of ML inference (lower is better): (1) ML is successful; (2) ML possible, 
but standard errors can’t be derived as the observed Fischer matrix is not 
invertible; (3) the construct has only been sequenced in one bin, so ML is not 
useful and only MoM inference is conducted; (4) no inference possible, 
construct has not been sequenced at all 


Percentage of reads at the first or last bin (lower is better) 

ML estimate of the lognormal wv parameter 

ML estimate of the lognormal o parameter 

ML estimate of the standard error for the lognormal y parameter 


ML estimate of the standard error for the lognormal o parameter 


descriptions of each column are provided in Table 2); and (4) a 
tab-delimited table named “metadata_inference.csv” filled with 
all parameters used to conduct the inference. It should be 
noted that this inference step is computationally expensive 
due to the likelihood calculations (see Note 2). 


3. By default, the “infer” command will use the data and options 
specified in Table 3. The “infer” command has many options to 
tailor how the inference step is performed. These are described 
in Table 3. For example, the following command: 


infer custom_bin --upper_bounds 1le2 1e3 1e4 2e4 --last_index 


all --metadata_path DATA_PATH 


will conduct inference on the data located in DATA_PATH and 


will assume the fluorescence upper bound of each bin is 
100, 1000, 10,000, and 20,000, respectively. 


48 Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Table 3 
Command-line options for the infer command 


Flag Type _ Default Description 


--distribution String gamma Fluorescence distribution. Either 
“gamma” or “lognormal” 


--metadata_path Path Latest simulation Path to MPRA data 
=—OUle_jenelal Path out/ Path for output files 
simulation DATETIME 

iE aLi? Sic_aLinGlayx Integer 0 Starting index for inference 

--last_index String sample Final index for inference: “‘sample” 
(only 100 constructs) or “all” (all 
constructs ) 

--num_workers Integer -1 (all CPUs) Number of parallel processes to use 

—-verbose Integer 1 Display progress messages: 1 = TRUE, 
0 = FALSE 

auto_bin 

——f max Float 1le5 Maximum measurable fluorescence value 


custom_bin 


--upper_bounds List le2 le3 le4 Upper fluorescence bounds of each bin 
First upper fluorescence bound must be 
greater than 1] a.u. 


“The auto_binand custom_bin options are mutually exclusive and should be specified directly after the p Lot 
keyword. Flags specific to their operation are provided below each option in the table 


3.3 Optimizing the When designing an MPRA experiment, it is useful to be able to 

Design of MPRA assess how various choices such as sequencing depth (i.e., total 

Experiments number of sequencing reads), number of bins used during cell 
sorting, etc. affect the accuracy of the inferred fluorescence distri- 
butions for each construct. Typically, this would be done through 
experimental trial and error. To avoid this, FORECAST provides 
the facility to simulate factorial designs where several design para- 
meters are varied between a set of discrete values in all possible ways 
and simulations used to assess their effect. This allows for the 
exploration of the experimental design space and can be used to 
pick parameters that ensure the most accurate inference of con- 
struct performance from experimental data. 


1. A full factorial combination of parameters for both simulation 
and analysis of an MPRA experiment can be performed by 
using the command 


factorial auto_bin 


FORECAST: Design and Analysis of MPRA Experiments 49 


This will create a directory called “factorial_design_” fol- 
lowed by the date and time in the “out” directory. For each 
simulation, three CSV files are created as described in Subhead- 
ing 3.2. For each file, the parameters are indicated in brackets at 
the end of the filename with the following order of parameters: 
“fmax” maximum measurable fluorescence level; “distribu- 
tion” statistical distribution underlying the fluorescence 
observed for each construct; “diversity” total number of differ- 
ent genetic constructs assessed; “bias” a Boolean indicating if 
all genetic designs in the library are sampled equally; “rep” the 
replicate number; “seq” total number of sequencing reads; 
“bins” number of bins used for cell sorting; “size” total num- 
ber of cells sorted; “pcr_amp” amplification ratio during the 
PCR step; and “f_amp” factor capturing fluorescence per pro- 
tein. By default, this command will only run the simulation and 
inference steps for the default parameters shown in Table 4. 


2. To specify a user-defined set of experimental parameters to use, 
command-line flags followed by space-delimited values can be 
used. For example, the following command: 


factorial auto_bin --reads 1e3 1e4 --bins 12 16 22 


will perform a full factorial analysis where experiments will be 
simulated for all combinations of 1 x 10° and 1 x 10* total 
sequencing reads and 12, 16, and 22 bins for cell sorting. The 
full set of available flags is described in Table 4. 


3.4 Assessing the Several tools are provided to help visualize the impact of both 
Accuracy of the experimental design and inference methods on the accuracy of the 
Inferred Distributions —_— estimates. 


1. When performing inference on MPRA data, the output files 
(i.e., “‘results.csv”) include additional information about the 
quality of the inference step that can be useful in filtering out 
those constructs where there is a large amount of uncertainty in 
for the inferred fluorescence distributions. Specifically, each 
row will contain an “‘inference_grade” value that can be used 
to manually filter constructs where issues during the inference 
step arose. For example, values >1 indicate errors in assump- 
tions of the mathematical methods or a lack of data making the 
inferred distributions inaccurate (see Table 2 for a full descrip- 
tion of all inference grades). This value can be used to manually 
filter only those with accurate inference (i.e., “inference_- 
grade” = 1), if required 

2. In addition, when working with simulated MPRA data and 
wanting to compare estimates of parameters for each inference 
method, the following command can be used 


plot_ci 


50 Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Table 4 


Command-line options for the ¢actorial command 


Flag Type Default Description 

--rep Integer 1 Number of replicates 

=—iF ime Float 1e5 Maximum measurable 
fluorescence value 

S—CliLSie Te sLSUlie aLOra String gamma Fluorescence distribution: 
“gamma” or “lognormal” 

=—19) Lia Integer 12 Number of log-spaced bins 
used for sorting 

= "S126 Float 1e6 Number of cells sorted 

= GES Float 1e5 Total number of sequencing 
reads 

--ratio_amplification Float le2 PCR amplification ratio 

--bias_library Boolean FALSE Use a different number of cells 
for each construct 

--metadata_path Path forecast/data/gamma Path to construct distributions 

==OWle_jaela Path out/ Path for output files 

simulation DATETIME 

=1F _ Suny) Integer 1 Fluorescence per protein 

SIE Lie Sie_slinclenx Integer 0 Starting index for inference 

== heiSie_sliaveliox String sample Final index for inference: 
“sample” (only 
100 constructs) or “all” (all 
constructs) 

--num_workers Integer —1 Number of parallel processes 
to use (—1 = all physical 
CPU cores) 

—-verbose Integer 1 Display progress messages: 


1 = TRUE, 0 = FALSE 


This will create a directory called “figure_CI_” followed by 
the date and time in the “out” directory. This directory will 
contain two files (plots), one for each distribution parameter 
(either a and b for gamma distributed data or w and o for 
lognormal distributed data) showing the estimated parameters 
inferred by the ML and MoM methods, as well as the ground 
truth value. An example of the output is shown in Fig. 3. 


FORECAST: Design and Analysis of MPRA Experiments 51 


(a) 
@ Ground truth 
> MOM estimator 
25 @ MLestimator 


20 


Parameter a estimate 
3 
a 
+ 
5 > a 
ooo? 
== 


TET TT nT Ah, 
F 1} y¥I Tttee, 
(b) 
ie @ Ground truth 
+ MOM estimator 
1200: @ MLestimator 
a ‘ 
800 
3 600 
ui 400 
200 7 °° 
0. OSD disictve: is cvesnsaseecivevnissnisseccasttistvcosaierveeniamesinioming 


Fig. 3 Example of plot_ci command output. (a) Comparison of estimates for the a parameter for an 
underlying gamma distribution. (a) Comparison of estimates for the b parameter for an underlying gamma 
distribution. Each point denotes an individual design with error bars for the maximum likelihood estimator 
showing the 99% confidence interval. MoM method of moments, ML maximum likelihood 


3. By default, the script will plot the results from the latest 
simulated data in the “out” directory. This can be changed 
though by providing flags such as: 


plot_ci --library_path LIBRARY_PATH --metadata_path DATA_PATH 


Here, “LIBRARY _PATH” is the path to the construct 
library that contains all the ground truth values for each con- 
struct (e.g., location of the “library_gamma.csv” or “library_- 
lognormal.csv” files), and “DATA PATH” is the path to 
simulation data from this construct library. 


52 Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


3.5 Visually It can be useful to visually compare the inferred fluorescence dis- 

Comparing Individual tributions for individual constructs to the underlying sequencing 

Inferred Distributions read histograms and potentially the ground truth if the data is 
simulated. This allows for the accuracy of the inference to be 
assessed on a per construct basis and further analysis of those 
constructs that may not adhere to the expected shape of the fluo- 
rescence distribution. 


1. For comparisons between the ML- and MoM- inferred distri- 
butions, as well as the read depth histogram, the following 
command can be used 


plot_pdf auto_bin 


This will create a directory called “figure_pdf_” followed 
by the date and time in the “out” directory. An example of the 
plot produced is shown in Fig. 4. By default (without any 
arguments), the script will plot the results for the latest 
simulated MPRA data in the “out” directory for the first con- 
struct in the library. 


Read count 
——— ML inference 
—— MoM inference 


nN o > 
co o i=] 
co o co 


Estimated number of cells per bin 


co 
co 


10° 10! 10? 10° 10* 10° 
Fluorescence (a.u.) 


Fig. 4 Example of p 1ot_pdf command output. The histogram denotes the number of cells per bin, while the 
two lines indicate the fluorescence distributions inferred from it. MoM method of moments, ML maximum 
likelihood, pdf probability distribution function, a.u. arbitrary units 


FORECAST: Design and Analysis of MPRA Experiments 53 


x10~? x10? x10? 
6 
Read count 
—— MLinference 
- 4 
400 4 —— MoM inference ks 
G 
We) —=— Groundtruth 
— 
a 
2 » Te 
iat 300 4 5 
E 2 \4 ke 
o ; =| }35 
Q 5 ' ° 
a [25 O 
= 2004 a : 
se) -) 
BA a 
© 
= 
D 1 
W 400 4 4 
0 +O 
0 


10° 10° 10? 10° 10* 108 
Fluorescence (a.u.) 


Fig. 5 Example of p1ot_a11 command output. The histogram denotes the number of cells per bin, while the 
two continuous lines indicate the fluorescence distributions inferred from it and the dashed black line the 
underlying ground truth distribution used to generate the data. MoM method of moments, ML maximum 
likelihood, pdf probability distribution function, a.u. arbitrary units 


2. For simulated data, the following command allows for the 
ground truth to be additionally plotted on the figure 


plot_all auto_bin 


Again, this will create a directory called “figure_pdf_” fol- 
lowed by the date and time in the “out” directory. An example 
of this type of plot is shown in Fig. 5. By default, the script will 
plot the results for the latest simulated MPRA data in the “out” 
directory for the first construct in the library. 


3. Both these commands can be provided with optional flags 
(detailed in Table 5) that allow for the plot to be customized. 
For example, the following command will create a comparison 
plot for construct number 13 from the last generated 
simulation data: 


plot_pdf auto_bin --construct 13 


54 Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Table 5 


Command-line options for the piot_pdf and p1ot_al11 commands 


Flag Type __ Default Description 

--distribution String gamma Fluorescence distribution: “‘gamma” or 
“lognormal” 

==COMSierWETt Integer 0 Construct index to plot 

--metadata_path Path Latest simulation Path to MPRA data 

--library_path Path  forecast/data/ Path to construct distributions (only for plot_all) 

gamma 

==llegeicl_ILoe String right Location for the legend: “‘left” or “right” 

auto_bin 

--f_max Float 1le5 Maximum measurable fluorescence value 

=—15) Lins Integer 12 Number of log-spaced bins used for sorting 

custom_bin 

--upper_bounds List le2 le3 le4 Upper fluorescence bounds of each bin 

First upper fluorescence bound must be greater 

than 1 a.u. 


*The auto_binand custom_bin options are mutually exclusive and should be specified directly after the p Lot 
keyword. Flags specific to their operation are provided below each option in the table 


4 Notes 


Acknowledgments 


1. 


FORECAST has been tested on UNIX-based operating systems 
(e.g., Linux and MacOS). For use on Windows, it is advised to use 
the Windows Subsystem for Linux (WSL) to run the software. 
When using FORECAST, it is essential that the “forecast” conda 
environment is activated (see Subheading 3.1, step 4) and all 
commands should be run for the root directory of the tool. 


. Inferring construct performance from the MPRA data is com- 


putationally expensive. FORECAST parallelizes this step to 
help reduce execution time when a multi-core CPU is available. 
For large factorial designs or libraries containing millions of 
constructs, we recommend considering the use of a compute 
cluster with many physical CPUs to speed up execution. 


This work was supported by BrisSynBio, a BBSRC/EPSRC Synthetic 
Biology Research Centre grant BB/L01386X/1 (T-E.G.), a EPSRC/ 
BBSRC Centre for Doctoral Training in Synthetic Biology grant 
EP/L016494/1 (P.-A.G.), a Turing Fellowship from The Alan Turing 
Institute under the EPSRC grant EP/N510129/1 (T.E.G.), and a 
Royal Society University Research Fellowship grant UF160357 (T.E.G.). 


FORECAST: Design and Analysis of MPRA Experiments 


References 


1. 


10. 


ll. 


12. 


Nielsen AAK, Der BS, Shin J et al (2016) 
Genetic circuit design automation. Science 
352:aac7341. https://doi.org/10.1126/sci 
ence.aac734] 


. Brophy JAN, Voigt CA (2014) Principles of 


genetic circuit design. Nat Methods 11: 
508-520. https://doi.org/10.1038/nmeth. 
2926 


. Ellis T, Wang X, Collins JJ (2009) Diversity- 


based, model-guided construction of synthetic 
gene networks with predicted functions. Nat 
Biotechnol 27:465-471. https://doi.org/10. 
1038 /nbt.1536 


. Ajo-Franklin CM, Drubin DA, Eskin JA et al 


(2007) Rational design of memory in eukary- 
otic cells. Genes Dev 21:2271-2276. https:// 
doi.org/10.1101/gad.1586107 


. Zong Y, Zhang HM, Lyu C et al (2017) Insu- 


lated transcriptional elements enable precise 
design of genetic circuits. Nat Commun 8:52. 
https: //doi.org/10.1038/s41467-017- 
00063-z 


. Bashor CJ, Patel N, Choubey S et al (2019) 


Complex signal processing in synthetic gene 
circuits using cooperative regulatory assem- 
blies. Science 364:593-597. https: //doi.org/ 
10.1126/science.aau8287 


. Castle SD, Grierson CS, Gorochowski TE 


(2021) Towards an engineering theory of evo- 
lution. Nat Commun 12:3326. https://doi. 
org/10.1038 /s41467-021-23573-3 


. Gorochowski TE, Karr JR, Parmeggiani F, Yor- 


danov B (2021) Editorial: computer-aided bio- 
design across scales. Front Bioeng Biotechnol 
9:501. https://doi.org/10.3389/fbioe.2021. 
700418 


. Sarkisyan KS, Bolotin DA, Meer MV et al 


(2016) Local fitness landscape of the green 
fluorescent protein. Nature 533:397-401. 
https: //doi.org/10.1038 /naturel17995 

Kuo S-T, Jahn R-L, Cheng Y-J et al (2020) 
Global fitness landscapes of the Shine- 
Dalgarno sequence. Genome Res. https:// 
doi.org/10.1101/gr.260182.119 

Kosuri S$, Goodman DB, Cambray G et al 
(2013) Composability of regulatory sequences 
controlling transcription and translation in 
Escherichia coli. Proc Natl Acad Sci 110: 
14024-14029. https://doi.org/10.1073/ 
pnas.1301301110 

Cambray G, Guimaraes JC, Arkin AP (2018) 
Evaluation of 244,000 synthetic sequences 
reveals design principles to optimize translation 
in Escherichia coli. Nat Biotechnol 36: 


13. 


14 


15 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


55 


1005-1015. https://doi.org/10.1038 /nbt. 
4238 

Peterman N, Lavi-Itzkovitz A, Levine E (2014) 
Large-scale mapping of sequence-function 
relations in small regulatory RNAs reveals plas- 
ticity and modularity. Nucleic Acids Res 42: 
12177-12188. https://doi.org/10.1093/ 
nar/gku863 


. Gorochowski TE, Ellis T (2018) Designing 


efficient translation. Nat Biotechnol 36: 
934-935. https://doi.org/10.1038/nbt. 
4257 


. Tarnowski MJ, Gorochowski TE (2022) Mas- 


sively parallel characterization of engineered 
transcript isoforms using direct RNA sequenc- 
ing. Nat Commun. https://doi.org/10.1038/ 
s41467-022-28074-5 


Gorochowski TE, Espah Borujeni A, Park Y 
et al (2017) Genetic circuit characterization 
and debugging using RNA-seq. Mol Syst Biol 
13:952. https://doi.org/10.15252/msb. 
20167461 


Gorochowski TE, Chelysheva I, Eriksen M et al 
(2019) Absolute quantification of translational 
regulation and burden using combined 
sequencing approaches. Mol Syst Biol 15: 
e8719. https://doi.org/10.15252/msb. 
20188719 


Park Y, Espah Borujeni A, Gorochowski TE 
et al (2020) Precision design of stable genetic 
circuits carried in highly-insulated E. coli geno- 
mic landing pads. Mol Syst Biol 16:e9584. 
https: //doi.org/10.15252/msb.20209584 
Sharon E, Kalma Y, Sharp A et al (2012) Infer- 
ring gene regulatory logic from high- 
throughput measurements of thousands of sys- 
tematically designed promoters. Nat Biotech- 
nol 30:521-530. https://doi.org/10.1038/ 
nbt.2205 

de Boer CG, Vaishnav ED, Sadeh R et al 
(2020) Deciphering eukaryotic — gene- 
regulatory logic with 100 million random pro- 
moters. Nat Biotechnol 38:56-65. https:// 
doi.org/10.1038/s41587-019-0315-8 

Lecun Y, Bottou L, Bengio Y, Haffner P 
(1998) Gradient-based learning applied to 
document recognition. Proc IEEE 86: 
2278-2324. https://doi.org/10.1109/5. 
726791 

Jordan MI (1986) Serial order: a parallel 
distributed processing approach. Technical 
report, June 1985-March 1986. California 
Univ., San Diego, La Jolla (USA). Inst. for 
Cognitive Science 


56 


23. 


24. 


25. 


26. 


27. 


28. 


Pierre-Aurelien Gilliot and Thomas E. Gorochowski 


Hochreiter S, Schmidhuber J (1997) Long 
short-term memory. Neural Comput 9: 
1735-1780. https://doi.org/10.1162/neco. 
1997.9.8.1735 

Cho K, van Merrienboer B, Gulcehre C et al 
(2014) Learning phrase representations using 
RNWN encoder-decoder for statistical machine 
translation. ArXiv14061078 Cs Stat 

Evfratov SA, Osterman IA, Komarova ES et al 
(2017) Application of sorting and next genera- 
tion sequencing to study 5'-UTR influence on 
translation efficiency in Escherichia coli. Nucleic 
Acids Res 45:3487-3502. https:/doi.org/10.1093/ 
nar/gkw1141 

Angenent-Mari NM, Garruss AS, Soenksen LR 
et al (2020) A deep learning approach to pro- 
grammable RNA switches. Nat Commun 11: 
5057. https://doi.org/10.1038/s41467- 
020-18677-1 

Kotopka BJ, Smolke CD (2020) Model-driven 
generation of artificial yeast promoters. Nat 
Commun 11:2113. https://doi.org/10. 
1038 /s41467-020-15977-4 

Vainberg Slutskin I, Weingarten-Gabbay S, Nir 
Ret al (2018) Unraveling the determinants of 


29. 


30. 


31. 


32. 


microRNA mediated regulation using a mas- 
sively parallel reporter assay. Nat Commun 9: 
529. https://doi.org/10.1038/s41467-018- 
02980-z 

Gilliot P-A, Gorochowski TE (2020) Sequenc- 
ing enabling design and learning in synthetic 
biology. Curr Opin Chem Biol 58:54-62. 
https: //doi.org/10.1016/j.cbpa.2020. 
06.002 

Quang D, Xie X (2016) DanQ: a hybrid con- 
volutional and recurrent deep neural network 
for quantifying the function of DNA 
sequences. Nucleic Acids Res 44:e107. 
https: //doi.org/10.1093 /nar/gkw226 
Eraslan G, Avsec Z, Gagneur J, Theis FJ (2019) 
Deep learning: new computational modelling 
techniques for genomics. Nat Rev Genet 20: 
389-403. https: //doi.org/10.1038/s41576- 
019-0122-6 

Taniguchi Y, Choi PJ, Li G-W et al (2010) 
Quantifying E. coli proteome and transcrip- 
tome with single-molecule sensitivity in single 
cells. Science 329:533-538. https://doi.org/ 
10.1126/science.1188308 


Check for 
updates 


Modeling Protein Complexes and Molecular Assemblies 
Using Computational Methods 


Romain Launay, Elin Teppa, Jeremy Esque, and Isabelle Andre 


Abstract 


Many biological molecules are assembled into supramolecular complexes that are necessary to perform 
functions in the cell. Better understanding and characterization of these molecular assemblies are thus 
essential to further elucidate molecular mechanisms and key protein-protein interactions that could be 
targeted to modulate the protein binding affinity or develop new binders. Experimental access to structural 
information on these supramolecular assemblies is often hampered by the size of these systems that make 
their recombinant production and characterization rather difficult. Computational methods combining 
both structural data, molecular modeling techniques, and sequence coevolution information can thus offer 
a good alternative to gain access to the structural organization of protein complexes and assemblies. Herein, 
we present some computational methods to predict structural models of the protein partners, to search for 
interacting regions using coevolution information, and to build molecular assemblies. The approach is 
exemplified using a case study to model the succinate-quinone oxidoreductase heterocomplex. 


Key words Protein-protein interaction, PPI, Molecular assembly, Protein structure prediction, 
Protein-protein docking, Sequence coevolution 


1. ‘Introduction 


Protein-Protein Interactions (PPIs) play an important role in the 
functioning of living cells, including cell-to-cell interactions and 
metabolic and developmental control [1, 2]. Most cellular 
functions are mediated by the assembly of proteins as more than 
80% of the proteins operate in vivo in the form of homo- or hetero- 
oligomers [3] whose constituents assemble/disassemble dynami- 
cally [4]. Interaction between the proteins can be permanent or 
transient. While permanent interactions will form a stable protein 


Romain Launay and Elin Teppa contributed equally with all other contributors. 


Supplementary Information The online version contains supplementary material available at [https://doi.org/ 
10.1007 /978-1-0716-2617-7_4]. 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_4, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


57 


58 


Romain Launay et al. 


complex, the transient interactions are rather involved in signaling 
and regulation pathways or substrate /metabolite channeling [2, 5, 
6]. Better understanding these molecular assemblies and PPIs is 
thus of major importance to further elucidate molecular mechan- 
isms of cellular processes, engineer synthetic metabolic pathways 
for synthetic biology, or identify drug targets for biomedical 
applications [5]. 

PPIs can be investigated at different levels. In vivo, yeast 
two-hybrid (Y2H, Y3H) techniques enable to detect protein 
interactions, while in vitro, a variety of methods can be used such 
as tandem affinity purification, affinity chromatography, coimmu- 
noprecipitation, protein arrays, protein fragment complementa- 
tion, phage display, and mass spectrometry [6-8] among others. 
At the structural level, investigation of PPIs has largely benefited 
from the growing number of protein-protein complexes solved in 
recent years using different biophysical techniques, such as X-ray 
crystallography, nuclear magnetic resonance spectroscopy, and 
cryo-electron microscopy [7]. To complete this arsenal of 
approaches, in silico molecular modeling based on a combination 
of template-based methods and docking approaches that can inte- 
grate experimental restraints (i.e., coevolution information) has 
also emerged as a powerful technique to investigate protein assem- 
blies, in particular when experimental data are lacking [3, 9]. 

In this chapter, we provide a brief introduction to computa- 
tional methods that allow to predict structural models of proteins, 
to search for interacting regions using inter-protein coevolution 
information, and to model and analyze molecular assemblies. The 
use of some of these methods and tools is illustrated for the model- 
ing of the succinate-quinone oxidoreductase heterocomplex as a 
case study. 


2 Methods for Building a 3D Model of a Protein 


Predicting the three-dimensional structure of a protein based on its 
sequence is still an open problem in research. Protein structure 
prediction methods on the basis of protein sequences are based 
on two principles: (i) protein structure is more conserved across 
evolution than protein sequence, and (ii) there is a finite and 
relatively small (less than 10,000) number of unique protein folds 
in Nature [10]. 

Structure prediction methods are broadly classified into two 
categories: (a) template-based modeling (which uses one or several 
known structure(s) as template(s)) and (b) template-free modeling 
(which predicts a protein structure without using a significant 
template). There are also hybrid approaches that combine the two 
kinds of methods. 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 59 


2.1 Template-Based 
Methods 


New modeling methods or corrections to existing methods 
continually emerge. There are several ways to keep up with the 
best existing methods, identify the progress over time, and recog- 
nize where future efforts may be most productively focused. One 
way is to be aware of CASP results (the Critical Assessment of 
Protein Structure Prediction, www.predictioncenter.org) con- 
ducted every 2 years since 1994. Another way is to check the 
Continuous Automated Model EvaluatiOn (CAMEO; www. 
cameo3d.org) project that provides weekly follow-ups for three 
different aspects of the prediction by web servers: (a) homology 
modeling, (b) model quality estimation, and (c) contact prediction. 

In recent years, machine learning approaches have contributed 
tremendously to improve the accuracy of structural prediction, 
even when no similar structure is known [11]. Particularly in the 
recent CASP14, the AlphaFold2 method [11] outperformed most 
methods by predicting structures with high accuracy. 


The methods referred to as template-based modeling include 
threading techniques and comparative modeling. Template-based 
modeling predicts the 3D structure of a query protein through the 
sequence alignment between the query and one or several proteins 
with known structures. When query and template sequences have 
been derived from a common ancestor, the method is referred to as 
homology modeling. However, proteins from different evolution- 
ary origins may still adopt a similar structure; in this case, threading 
methods are used to identify structural templates. 

Generally, the process of comparative modeling involves four 
steps: (a) template identification, (b) sequence alignment, 
(c) model building, and (d) model refinement and validation. If 
the model is not satisfactory, some or all of the steps can be 
repeated. As such, the success of homology modeling depends on 
the ability to identify the closely homologous templates based on 
sequence identity and to generate an accurate query-template 
alignment. The goal of the alignment is to map_ the 
one-dimensional target sequence onto corresponding three- 
dimensional positions of the template structure correctly, ideally 
with only substitutions and small insertions/deletions. Broadly 
speaking, comparative modeling produces a good result if the 
query-template alignment has a global sequence identity 230%. 
As the sequence identity decreases, a correct template identification 
is more difficult and prone to misaligned regions. When query- 
template sequence identity is between 20% and 30%, they fall in the 
twilight zone; the evolutionary relatedness of proteins becomes 
uncertain [12, 13]. In this case, the threading technique may help 
to identify remote homology, leaving the ab initio method as the 
last alternative for protein structure prediction. 


60 Romain Launay et al. 


2.2 Template-Free or 
Ab Initio Methods 


2.3 Servers for 
Protein Structure 
Prediction and Related 
Databases 


2.3.1. MODELLER via 
ModWeb and ModBase 


For query proteins that have no structurally related protein in the 
PDB library, the structure must be built from scratch. This proce- 
dure is called ab initio modeling, de novo modeling, or template- 
free modeling. An ab initio method conducts an exhaustive search 
to identify the minimum energy conformation through optimiza- 
tion algorithms, such as Monte Carlo [14] or molecular dynamics 
[15], using knowledge-based scoring or physics-based energy func- 
tions. This procedure generates several putative conformations 
(also called decoys), and final models are selected from them. A 
successful ab initio modeling depends on three factors: 


(a) An accurate energy function that scores the native structure of 
a protein as being the most thermodynamically stable state, 
compared to all possible decoy structures 


(b) An efficient search method that can quickly identify the 
low-energy states through conformational search 


(c) A strategy that can select near-native models from a pool of 
decoy structures 


Hereafter are presented some servers and databases used for protein 
structure prediction based on various strategies and using, in some 
cases, sequence coevolution information and artificial intelligence- 
derived methods. 


MODELLER is one of the most widespread comparative modeling 
methods for prediction of protein structures [16]. Models are 
obtained by satisfying spatial restraints derived from the query- 
template alignment. 

These restraints include: 


(a) Ca-Ca and backbone N-O distances and dihedral angles 
restraints 


(b) Stereochemical restraints from the CHARMM-22 force field 


(c) Statistical preferences for dihedral angles and non-bonded 
inter-atomic distances derived from representative sets of 
known protein structures 


Optionally, it is possible to add manually additional restraints. 
MODELLER is available free of charge only to academic nonprofit 
institutions at https: //salilab.org/modeller/. 

Several servers based on MODELLER have been developed 
such as ModWeb or ModBase. 

ModWeb server (https://modbase.compbio.ucsf.edu/mod 
web/) offers the possibility to use MODELLER online. 

ModBase (http: //salilab.org/modbase) is a database contain- 
ing fold assignments, sequence-structure alignments, models, and 
model assessments for all sequences related to a known structure 
[17]. The models are derived by ModPipe, an automated modeling 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 61 


2.3.2 PHYRE2 


2.3.3 I-TASSER 


pipeline relying on the programs PSI-BLAST [18] and MODEL- 
LER. ModBase also includes binding site prediction for small 
ligands and a set of predicted interactions between pairs of modeled 
sequences from the same genome that are predicted to interact with 
each other. 


PHYRE2 (http://www.sbg.bio.ic.ac.uk/phyre2) is designed to 
predict a protein three-dimensional structure from a protein 
sequence [19]. The server uses a powerful strategy to detect remote 
homology combining PSI-BLAST alignment with hidden Markov 
models (HMM) via HHsearch for template detection. The primary 
algorithmic strategy is composed of four steps. In the first step, 
homologous sequences of the query are searched using HHblits. 
The resulting alignment is used to predict secondary structure. In 
the second step, HHsearch is performed against a database of 
HMMs of protein of known structures. The top-scored alignments 
are used to construct the protein model backbone. In the third 
step, the loops are modeled, and in the last step, the side chains are 
added to generate the final model. When the intensive mode is 
used, a step is added to use an ab initio folding simulation called 
Poing” to model regions of the query protein with no detectable 
homology to known structures. 


I-TASSER (Iterative Threading ASSEmbly Refinement) is a hierar- 
chical approach to protein structure and function predictions from 
their amino acid sequences [20]. I-TASSER is accessible via a web 
server (https://zhanglab.dcmb.med.umich.edu/I-TASSER) and a 
stand-alone package. Starting from an amino acid sequence, the 
algorithm tries to retrieve protein templates of similar fold from the 
Protein Data Bank (PDB: https://www.rcsb.org) using a meta- 
threading approach called LOMETS (https://zhanggroup.org/ 
LOMETS/). In the next step, the continuous fragments taken 
from the PDB templates are reassembled into full-length models. 
For cases where no appropriate template is identified, I-TASSER 
builds the whole structure by ab initio modeling. SPICKER iden- 
tifies the low free-energy states through clustering the simulation 
decoys (https://zhanggroup.org/SPICKER/). In the third step, a 
second iteration of the fragment assembly simulation is performed 
again to remove the steric clash and refine the global topology of 
the cluster centroids. The decoys generated are then clustered, and 
the lowest energy structures are selected followed by an optimiza- 
tion of the hydrogen-bonding network. The final model is used to 
predict the protein biological function by matching the model with 
other known proteins using the enzyme classification 
(EC number), gene ontology vocabulary, and ligand binding 
sites. More recently, an I-TASSER-derived method called D-I- 
TASSER has been developed for distance-guided protein structure 
prediction  (https://zhanggroup.org//D-I-TASSER/). This 


62 Romain Launay et al. 


2.3.4 trRosetta 


2.3.5  AlphaFold2 Method 
and Structural Database 


method integrates inter-residue contacts predicted by deep neural 
network and has been reported to significantly enhance accuracy of 
models compared to I-TASSER. 


trRosetta (transform-restrained Rosetta) is an algorithm for protein 
structure prediction using a deep neural network to predict the 
inter-residue distances [9]. The algorithm is available in a stand- 
alone version and a web server (https://yanglab.nankai.edu.cn/ 
trRosetta/). The input is the amino acid sequence or a multiple 
sequence alignment of the query protein. A deep neural network is 
applied to predict the inter-residue distances and orientation dis- 
tributions between residues. Some of the features used in the con- 
volutional layers of the networks include amino acid frequencies, 
entropies, and coevolutionary couplings. 

Predicted inter-residue distances and orientations are used as 
restraints to guide the Rosetta method to build three-dimensional 
structure models based on direct energy minimization. 

Recently, the algorithm was modified to include the option to 
use templates. It is recommended to run the algorithm including 
homologous templates, which are used to add restraints to Rosetta. 


Given a query sequence, AlphaFold2 [11] searches for related 
sequences in three databases: UbiRef¥0, BFD, and MGnify. 
Then, potential templates are searched using HHsearch against 
the PDB70 database [21]. The input sequence, multiple sequence 
alignment, and template hits are used as inputs for the deep 
learning-based method that produces a variety of predictions 
including distances, torsions, and atom coordinates. Then, the 
predicted 3D model is relaxed using restrained gradient descent 
with the Amber ff99SB force field [22] integrated in 
OpenMM [23]. 

AlphaFold2 produces a per-residue confidence metric called 
the predicted local distance difference test (pPLDDT) on a scale 
from 0 to 100, to estimate how well the prediction agrees with an 
experimental structure considering the Ca. A pLDDT >90 is con- 
sidered as a highly accurate prediction; in addition to a good 
backbone prediction, the side chains are often correctly oriented 
(yl rotamers are 80% correct). Regions with pLDDT between 
70 and 90 indicate a generally good backbone prediction. Regions 
with pLDDT between 50 and 70 are low confidence and should be 
treated with caution. Finally, regions with pLDDT <50 are proba- 
bly disordered. 

In CASP14, AlphaFold2 was the top-ranked protein structure 
prediction method, producing predictions with high accuracy [24]. 

The source code of AlphaFold2 is available on GitHub 
(https: //github.com/deepmind/alphafold). It is also possible to 
use AlphaFold2 via the Google ColabFold notebooks [25], a free 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 63 


platform for protein folding that does not require any installation 
or expensive hardware. Several ColabFold notebooks are available 
on GitHub (https: //github.com/sokrypton/ColabFold). 

DeepMind and EMBL’s European Bioinformatics Institute 
(EMBL-EBI) created the AlphaFold database (https: //alphafold. 
ebi.ac.uk) to provide open access to protein structure predictions 
generated by the AlphaFold2 method. At the moment, the predic- 
tions cover almost the entire human proteome [26] and the 
proteomes of several other key organisms such as E. coli, fruit fly, 
mouse, and zebrafish, among others, totaling over 350,000 protein 
structures. The database provides three outputs from AlphaFold2: 
the three-dimensional coordinates, the per-residue confidence met- 
ric pLDDT, and the Predicted Aligned Error, which is necessary to 
assess confidence in the domain packing and large-scale topology of 
the protein. 


3 Protein-Protein Interaction Prediction Using Coevolution 


We refer to molecular coevolution when a change in one locus 
affects the selection pressure at another locus, and this change is 
reciprocal [27, 28]. In other words, when a mutation occurs in a 
particular position, another mutation may occur to compensate for 
the change or restore the protein function. As coevolving residues 
tend to be close in the tridimensional structure, coevolution has 
been successfully applied to predict intra- and inter-protein residue 
contacts [29-32]. When coevolution methods were applied at 
whole-proteome scale combined with structure modeling to pre- 
dict protein-protein interactions, the accuracy of interaction pre- 
diction is higher than the proteome-wide two-hybrid and mass 
spectrometry screens [33]. A large panel of methods exists to 
predict molecular coevolution; all of them use a multiple sequence 
alignment (MSA) as input. In general, a large number of diverse 
sequences are required to obtain reliable results. To predict inter- 
protein coevolution between two proteins A and B, the real input of 
the coevolution algorithm is the concatenated alignment; protein A 
and protein B for each organism must be properly paired (Fig. 1). 
Building the concatenated alignment is not straightforward, 
because each row of the MSA should contain a pair of interacting 
proteins out of two protein families. That means that it is desirable 
to concatenate orthologous proteins, as they are likely to perform 
an equivalent function, rather than other types of homologs. 

The I-COMS web server (http://i-coms.leloir.org.ar) allows 
computing inter-protein contact prediction using four different 
covariation methods [34]. The server gives the option to provide 
the concatenated alignment or build it automatically. The server 
includes four covariation methods: corrected mutual information, 


64 Romain Launay et al. 


Protein A Protein B Structure A Structure B 
0-3-0-8-- = DOGO | a” 
R D vo 
rs 
K E 2 ") 
R D ) 
R D J 
J 
Coevolving positions Contact residues 


Fig. 1 Inter-protein coevolution. In the concatenated alignment between two interacting proteins A and B, two 
positions coevolve (indicated with an arrow) to maintain favorable interactions between physically interacting 
amino acid residues (indicated as *) in the three-dimensional structure 


mfDCA, PSICOV, and CCMpred. Intra- and inter-protein results 
are provided in an interactive visualization allowing the comparison 
between methods as well as the concordance between results. 
Covariation positions can be calculated for up to five proteins. 


4 Protein Assembly Prediction and Analysis 


4.1 Protein-Protein 
Docking: Principles 
and Methods 


When the structural information of different protein partners is 
available through experimental data or modeling, the docking 
approach is used as a standard method to predict the potential 
interactions. The aim of docking is to find the best matched 3D 
structure of the protein complex among several protein models. To 
do so, a fast search algorithm is used to sample all possible spatial 
conformations, and a scoring function is needed to rank the solu- 
tions. Due to the large number of possibilities for the position and 
angle of protein residues, spatial search algorithms in protein- 
protein docking can be divided into three main categories: 
(a) exhaustive global search including fast Fourier transform 
(FFT)-based search implemented [35, 36] and spherical Fourier 
transform-based search [37-39], (b) randomized search using 
Monte Carlo [40, 41], and (c) local shape feature matching includ- 
ing geometric hashing [40]. It is important to notice that all 
FFT-based approaches perform rigid-body docking because the 
related grid cannot be updated, unlike randomized search 
algorithms. 

Protein-protein docking methods typically generate thousands 
of potential solutions for a particular complex. To discriminate 
near-native solutions, the development of a scoring function is 
needed and is still challenging. These scoring functions can be 
divided into several categories, sometimes combined: (a) physics- 
based scoring function capturing the determinants related to the 
stability of protein-protein complexes, e.g., shape complementary, 
van der Waals, electrostatics, and desolvation potential [41-47], 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 65 


4.2 ZDOCK 


4.3 InterEvDock3 


(b) knowledge-based functions taking advantage of the informa- 
tion from available structures [48-51], (c) scoring functions com- 
bining physical terms with knowledge-based terms [52-55], 
(d) evolutionary scoring function based on the protein sequence 
evolution [56, 57], and (e) consensus-based scoring functions 
seeking to identify solutions with high occurrence features, inde- 
pendently of any physics-based or evolutionary evaluation, such as 
conservation of interface contacts [58-62]. Along the same line as 
the CASP contest for protein structure prediction, the CAPRI 
competition allows a blind assessment of the most recent methods, 
offering an updated view of progress in the field [63-65 ]. 


ZDOCK is a protein-protein docking method available through an 
online web server (https://zdock.umassmed.edu/) [66]. It uses 
the fast Fourier transform algorithm to enable an efficient docking 
search. It is a user-friendly server to predict complexes that proceed 
in three steps. The first step is to provide two input structures 
(by PDB code or PDB file) and choose the ZDOCK version. The 
second step is the selection of blocking or contacting residues for 
each protein submitted. The last step is the result analysis and 
visualization, including the top ten docking models. 


InterEvDock3 (https://bioserv.rpbs.univ-paris-diderot.fr/ 
services/InterEvDock3/) is a server designed for predicting pro- 
tein pairwise assemblies, based on sequence or on structure, and 
possibly combined with coevolution data [67]. Three protocols are 
implemented to use at best the available information. 

The first method is template-based docking; it uses sequences 
to search the protein assembly with already known structures. 
Template-based docking protocols need two or more sequences 
and a protocol search among a list of interacting proteins if the 
structure of protein homologs is available in complex with partners, 
based on HHsearch. The structural assembly is built with threading 
for the main parts, and the missing parts are built with the 
DaReUS-Loop program [68]. 

The two other methods perform free docking using the 
FRODOCK software. Then, generated models are ranked accord- 
ing to the coevolution information given by the user or computed 
by the server. 


5 Case Study: Modeling the Succinate-Quinone Oxidoreductase Heterocomplex 


We propose to build a structural model of the supramolecular 
complex succinate-quinone oxidoreductase (SQR). SQR is a key 
enzyme in the Krebs cycle, oxidizing succinate to fumarate and 
reducing quinone to quinol, acting as a link between the Krebs 
cycle and the respiratory chain. Escherichia coli SQR has four sub- 
units, two hydrophilic subunits exposed to the cytoplasm (SdhA 


66 Romain Launay et al. 


5.1 Building a 3D 
Model Using 
AlphaFold2: SQR 
Subunits, SdhA, 
and SdhC 


and SdhB), which interact with two hydrophobic membrane- 
intrinsic subunits (SdhC and SdhD) [69]. Interestingly, SdhA and 
SdhB have already been shown to coevolve together. This informa- 
tion enabled to predict the proper interacting interface [29-32 ] 
compared to the crystallographic protein structure of E.coli SQR 
[70, 71] (PDB code: INEK, 2WDQ). 

For pedagogical purposes, we provide step-by-step instructions 
to generate the structural models of the heterotetramer subunits 
and their assembly (Fig. 2). First, we shall build a structural model 
for all subunits (SdhA, SdhB, SdhC, and SdhD) using either the 
AlphaFold2 method without template or I-TASSER without using 
close templates. This choice will mimic cases where no crystallo- 
graphic information is available. Second, we will use inter-protein 
coevolution detection to predict residue contacts between the sub- 
units. The dataset for coevolution comes from the available data 
reported in reference [30] and is provided in supplementary infor- 
mation (SI1). Third, the predicted residue contacts will be used to 
guide the protein-protein docking. Fourth, a docking was carried 
out between the dimers SdhA-SdhB and SdhC-SdhD without 
using coevolution information. 


To avoid setting up AlphaFold2 on your local computer, we will use 
an online version to build the 3D models of SdhA and SdhC. The 
following steps are the same for SdhA (UniProt ID: POAC4) and 
SdhC (UniProt ID P69054): 


1. Download the amino acid sequence of the target in FASTA 
format from UniProt. 


2. Go to ColabFold repository (https://github.com/sokrypton/ 
ColabFold). 


3. Choose the Notebook AlphaFold2 (from DeepMind). 


4. Execute the first two cells by clicking the play button. It will 
install the required programs in the cloud, and not on your 
computer. 


5. Wait until the task is completed, a green tick mark will appear at 
the left of the play button. You can also visualize the progres- 
sion of each task in the progress bar (Fig. 3a). 


6. Paste the protein sequence without the FASTA header in the 
text box. 


7. Select Runtime -> Run After in the toolbar at top of screen. 
8. Unzip the file downloaded automatically with the results. 
9. It’s done! Now, we are ready to analyze the results. 
To make sure that you can reproduce the result, it is recom- 
mended to save a copy of the notebook on your computer. You can 


find several options to save the notebook in the Fi/e menu in the 
top bar. 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 67 


) Structural models 


) Inter-protein 
eal prediction 

3) Docking 

SdhB I-TASSER InterEvDock3 

e) 
£ 
ZDock 
SdhC AlphaFold2 
InterEvDock3 


4)Docking 
SQR complex 


SdhA AlphaFold2 


Kesh _ 


|- ~-COMS 


SdhD I-TASSER 


Fig. 2 Strategy to model the heterocomplex succinate-quinone oxidoreductase (SQR). The complex model was 
built as follows. First, we shall build a structural model for all subunits (GSdhA, SdhB, SdhC, and SdhD) using 
either the AlphaFold2 method without template or I-TASSER without using close templates. Second, we will 
use inter-protein coevolution detection to predict residue contacts between the subunits. The dataset for 
coevolution comes from the available data reported in reference [32] and is provided in supplementary 
information (SI1). The inter-protein contact prediction was carried out using I-COMS. Third, the two subunits 
were docked using InterEvDock3 with coevolution information, and in the fourth step a docking was carried 
out between the dimers using ZDOCK without coevolution information 


To analyze the results, we will visualize two parameters: (a) the 
number of sequences and gaps for contact prediction (Fig. 3b) and 
(b) the AlphaFold per-residue confidence score (pLDDT) that is 
found in the B-factor fields of the coordinate files (Fig. 3c). Both 
sequence information and pLDDT score per residue provided on 
average a good confidence about the quality of 3D models 


68 Romain Launay et al. 


Fig. 3 Building a 3D model of SdhC from E. coli using AlphaFold2. Following the ColabFold notebook running 
process (a). Coverage of the multiple sequence alignment used by AlphaFold2 (b). Structural model colored by 
pLDDT (c). The AlphaFold2 method predicts a bundle of transmembrane helices and a disordered/coil region in 
N-term. In this latter, a low confidence is determined due to the lack of information in this region (N-term 


region in B) 


5.2 Building a 3D 
Model Using I-TASSER: 
SOR Subunits, SdhB, 
and SdhD 


(SdhA and SdhC). To confirm this result, both 3D models were 
compared with the corresponding X-ray structures (PDB code: 
INEK chain A and C). Using TM-align server (https:// 
zhanggroup.org/TM-align/), structural alignments between 
models and solved structures gave RMSD values of 0.73 A and 
1.33 A for SdhA and SdhC, respectively. It is worth noting that 
these RMSD values correspond to aligned residues; thus these 
latter can increase when considering the whole structure as the 
loop/coil/disordered regions highlighted in Fig. 4. 


To avoid installation and set up programs on your computer, we 
will use the widely used I-TASSER webserver to build the 3D 
models of SdhB and SdhD. 


1. Register yourself (https: //zhanggroup.org/I-TASSER/regis 
tration.html). 


2. Download the amino acid sequence of the target in FASTA 
format from UniProt (UniProt ID: P07014 and POAC44 for 
SdhB and SdhD, respectively). 


3. Go to I-TASSER webserver (https://zhanggroup.org/I-TASSER/). 

4. Paste the protein sequence in FASTA format in the text box 
(Fig. 5a). 

5. Type 60% to exclude homologous templates in the Option II 
section. 

6. Identify you with email and password. 

7. Click on the “Run I-TASSER” box. 


8. Wait for results sent by email. 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 69 


Fig. 4 Structural comparison between X-ray structure (Inek) and 3D models from AlphaFold2. SdhA (a) and 
SdhC (b) structures are shown in cartoon and colored as in Fig. 2. X-ray structures are displayed in transparent 
gray cartoon representation. Red squares highlight the main regions where Alphafold2 differs from the X-ray 
structure 


5.3 Modeling SdhA- 
SdhB and SdhC-SdhD 
Using Protein-Protein 
Docking and 
Coevolution 
Information 


To analyze the results, we will visualize two parameters: (a) the 
threading templates used by I-TASSER and the alignment quality 
against the target sequence (Norm Z-score) (Fig. 5b) and (b) the 
I-TASSER score (c-score) that gives the confidence of each model 
based on the significance of threading template alignments and the 
convergence parameters of the structure assembly simulations 
(Fig. 5c). This score is comprised between —5 and 2, with higher 
values (close to 2) indicating a higher confidence on the 3D 
model and vice-versa. Both templates and C-score (1.23 and 0.53 
for SdhB and SdhD, respectively) provided good confidence about 
the quality of 3D models. Indeed, the best C-score was obtained 
using the templates chain B and C from 1YQ3 for SdhB and SdhD, 
respectively. Even if the sequences from 1YQ3 share ~50% and 20% 
of identity with SdhB and SdhD, respectively, the selected template 
corresponds to the same functional complex from another organ- 
ism (Gallus gallus). To confirm this result, both 3D models were 
compared with the corresponding X-ray structures (PDB code: 
1NEK chain B and D). Using the TM-align server, structural 
alignments between models and solved structures gave RMSD 
values of 2.01 A and 2.19 A for SdhB and SdhD, respectively. 


Among the six possible protein pairs composing the heterotetra- 
mer, we focused on the prediction of SdhA-SdhB and SdhC-SdhD, 
the first pair corresponding to the cytosolic subunits and the second 
one to the membrane domains. We will use inter-protein coevolu- 
tion to predict contacts between these two subunit pairs using 
I-COMS server. The input will be the alignments taken from a 


70 Romain Launay et al. 


Rank POB ident kden2 Cov Norm. Download 
Hit Zscore Align. 


. Reset to initial orientation 


i Spin On/Off 
; ‘+ Download Model 1 
C-score=0.53 (Read more about C-score) 
Estimated TM-score = 0.78+0.09 


Estimated RMSD = 3.242.2A 


Fig. 5 Building a 3D model of SdhD from £. coli using |-TASSER. Following the submission process 


described between steps 4 and 7 (a). Top ten of threading templates (b). Best 3D model out of the top five 
final models (c) 


previously published and publicly available dataset and provided in 
supplementary information (SI1). 


1. Download the alignments from SI1. 


2. Go to the I-COMS server (http://i-coms.leloir.org.ar/index. 
php). 


3. Select the option “Upload your own alignments.” 

4. Optionally, you can describe the uploaded dataset. 

5. Upload the two alignments using the “Browse...” button. 

6. Click on “Upload and submit.” 

7. Choose the method for coevolution: plmDCA. 

8. Optionally you can indicate the job description and your email 
address. 


Results include information about the alignment used, such as 
the number of sequences and clusters. If the number of clusters is 
low (<400), it means that there is little diversity in the MSA and the 
results should be interpreted with caution. Results are shown in a 
circos representation of the covariation scores of each of the 
selected methods, and protein pairs are displayed (Fig. 6). 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 71 


A 


SdhB YY  SdhA 


SdhB 


gates Semen 


ie 
pe 


Fig. 6 Docking of subunits using coevolution information. Top five inter-protein coevolution results from 
I-COMS server. The inner circle represents the sequence positions in boxes colored according to the sequence 
they belong to (SdhA or SdhB). The correlated mutation scores are represented as lines between positions in 
the center of the circle. Given as example, the coevolving positions K38 and R52 from SdhA and SdhB, 
respectively, are indicated (a). Top five inter-protein coevolving positions are shown in the modeled subunits; 
the Ca of coevolving positions are shown in sphere representation (b). Analogous results are given for 
subunits SdhC and SdhD, the top five coevolution results (c) and the same coevolving pairs mapped on the 
models (d) 


To visualize the inter-protein results: 
1. Choose the pair of proteins (SdhA vs SdhB) or (SdhC vs SdhD). 
2. Select the method. 


3. Click on “Draw Circos.” 


72 Romain Launay et al. 


5.4 Modeling the 
Succinate-Quinone 
Oxidoreductase 
Heterocomplex Using 
Protein-Protein 
Docking and 
Restraints 


4. Click on “Inter-protein” links. 
5. You can select the number of edges to visualize. 


6. Download covariation raw data, it will be used in the next steps. 


Protein docking of SdhA-SdhB and SdhC-SdhD will be per- 
formed using InterEvdock3 server and residue contact predictions 
from I-COMS as described previously. The inputs will be the pdb 
files of the two partners to dock and a list of residue pair contacts. 


7. Go to InterEvdock3 server (https://mobyle.rpbs.univ-paris- 
diderot.fr/cgi-bin/portal.py#forms::InterEvDock3). 


8. Upload Partner A and Click on “Browse...” to browse and 
select pdb file. 


9. Upload Partner B and Click on “Browse...” to browse and 
select pdb file. 


10. Click on “Advanced Options.” 
11. Go to “Use of co-evolution or deep-learning maps.” 


12. Upload the coevolution map (Top 100) from I-COMS given 
in S12. 


13. Select “Yes” in “Minimize the output models using 
gromacs.” 


14. Click on run. 


InterEvdock3 web portal enables to follow the job progress at 
any time without any specific link. The https link associated with the 
job can be stored locally for caution. 

Main InterEvdock3 output provides two kinds of rankings 
limited to the top ten poses: (a) based on the number of structural 
contacts matching the predicted coevolution pairs and (b) based on 
the scoring function related to the sum of the best predicted 
coevolution pairs. 

In this study, the best docking poses for both heterodimers are 
selected from the second type of ranking, which leads to favor the 
most probable pairs related to their coevolution score. The result- 
ing models of the heterodimers are provided in SI3. 


As there is not enough information when merging concatenated 
MSA from SdhA, SdhB, SdhC, and SdhD, coevolution cannot be 
used to predict residue contacts. Therefore, the docking between 
the predicted partners will be done using “classical” docking. Free 
docking and docking with restraints will be performed using 
ZDOCK server. To avoid clashes and improve docking prediction, 
N-term disordered regions for SdhC and SdhD are removed, 
corresponding to the first 13 residues and the 10 residues, 
respectively. 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 73 


6 Conclusions 


1. Go to https://zdock.umassmed.edu/. 


2. Choose “PDB file” in the scrolling list close to “Input Protein 
1” keyword. 


” 


3. Click on “Browse ... 
SdhA-SdhB. 


4. Repeat steps 2 and 3 for Input Protein 2. 


to select PDB file corresponding to 


5. Fill up the form “Enter your email.” 


6. Optionally, for free docking, check the box close to Skip residue 
selection. 


7. Click on “Submit” button. 


8. If Skip residue selection was not checked, select interactively the 
residues belonging to the binding site for guiding docking. 


9. Click on “Submit” button. 
10. Wait for results sent by email. 
11. Download top ten predictions. 


12. Select the first docking poses. 


This particular case seems to be difficult for good docking 
prediction. Indeed, free docking does not provide a good solution 
compared to the X-ray structure. To get a correct assembly, a list of 
17 and 19 residues from SdhB and SdhC-SdhD (given in SI2) had 
to be provided to guide the docking. The binding residues at the 
interface can be selected on distance threshold criteria, 3.2 A on 
heavy atoms from X-ray structure in this work. Having this kind of 
information helps to have better predictions as shown in Fig. 7. 
Superposition of the modeled heterotetramer onto the X-ray struc- 
ture (PDB code: 1NEK) showed an RMSD of ~0.73 A based on 
TM-align server, indicating a very good fit. The coordinate file of 
the final model is provided in SI3. 


Overall, this study shows that protein complex prediction is not a 
trivial question. The first crucial work is to obtain the structure of 
each protein partner. According to the available data, different 
approaches can be applied with a new methodology outperforming 
the others, called Alphafold2. Part of the success in the assembly 
construction will first depend on the quality of the 3D structural 
model of each partner. Therefore, assessment such as pLDDT is an 
important step at this stage. Then, protein-protein interactions can 
be predicted with reasonable confidence when diverse information, 
such as coevolution prediction or experimental results, is available 
to guide toward the most probable assembly. In this study, both 
cases are exemplified. Two heterodimers were quite well predicted 


74 


Romain Launay et al. 


Fig. 7 Superposition of the modeled SDQ heterotetramer onto the reference 
structure. Each modeled subunits SdhA, SdhB, SdhC, and SdhD is shown in 
cartoon representation and is colored according to the corresponding label. The 
heterotetramer is obtained from the docking of the two main units SdhA-SdhB 
and SdhC-SdhD. The reference corresponds to the X-ray structure (PDB code: 
1NEK), which is shown in white cartoon representation for clarity 


using coevolution information thanks to the diversity of the data. 
However, construction of the heterotetramer assembly was quite 
challenging because the interactions with the membrane are not 
taken into account in the docking procedure. To circumvent this 
limitation, a set of amino acid residues from the protein interface 
identified from experimental data was used to guide the construc- 
tion of the heterotetramer assembly. 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 


References 


1. 


10. 


ll. 


12. 


13. 


14. 


Pieters BJGE, van Eldijk MB, Nolte RJM, 
Mecinovié J (2016) Natural supramolecular 
protein assemblies. Chem Soc Rev 45:24-39 


. Berggard T, Linse S, James P (2007) Methods 


for the detection and analysis of protein- 
protein interactions. Proteomics 7:2833-2842 


. Soni N, Madhusudhan MS (2017) Computa- 


tional modeling of protein assemblies. Curr 
Opin Struct Biol 44:179-189 


. Sweetlove LJ, Fernie AR (2018) The role of 


dynamic enzyme assemblies and_ substrate 
channelling in metabolic regulation. Nat Com- 
mun 9:2136 


. Chiesa G, Kiriakov S, Khalil AS (2020) Protein 


assembly systems in natural and synthetic biol- 
ogy. BMC Biol 18:35 


. Zhang Y, Fernie AR (2021) Stable and tempo- 


rary enzyme complexes and metabolons 
involved in energy and redox metabolism. 
Antioxid Redox Signal 35:788-807 


. Rao VS, Srinivas K, Sujinit GN, Kumar GNS 


(2014) Protein-protein interaction detection: 
methods and analysis. Int J Proteomics 2014: 
147648 


. Wu F, Minteer S (2015) Krebs cycle metabo- 


lon: structural evidence of substrate channeling 
revealed by cross-linking and mass spectrome- 
try. Angew Chem Int Ed Engl 54:1851-1854 


. Yang J, Anishchenko I, Park H, Peng Z, 


Ovchinnikov S, Baker D (2020) Improved pro- 
tein structure prediction using predicted inter- 
residue orientations. Proc Natl Acad Sci US A 
117:1496-1503 

Koonin EV, Wolf YI, Karey GP (2002) The 
structure of the protein universe and genome 
evolution. Nature 420:218-223 

Jumper J, Evans R, Pritzel A, Green T, 
Figurnov M, Ronneberger O et al (2021) 
Highly accurate protein structure prediction 
with AlphaFold. Nature. https://doi.org/10. 
1038 /s41586-021-03819-2 

Sander C, Schneider R (1991) Database of 
homology-derived protein structures and the 
structural meaning of sequence alignment. 
Proteins 9:56-68 

Chung SY, Subbiah S (1996) A structural 
explanation for the twilight zone of protein 
sequence homology. Structure 4:1123-1127 
Heilmann N, Wolf M, Kozlowska M, 
Sedghamiz E, Setzler J, Brieg M et al (2020) 
Sampling of the conformational landscape of 
small proteins with Monte Carlo methods. Sci 
Rep 10:18211 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


75 


Geng H, Chen F, Ye J, Jiang F (2019) Applica- 
tions of molecular dynamics simulation in 
structure prediction of peptides and proteins. 
Comput Struct Biotechnol J 17:1162-1170 
Sali A, Blundell TL (1993) Comparative pro- 
tein modelling by satisfaction of spatial 
restraints. J Mol Biol 234:779-815 

Pieper U, Webb BM, Dong GQ, Schneidman- 
Duhovny D, Fan H, Kim SJ et al (2014) Mod- 
Base, a database of annotated comparative pro- 
tein structure models and associated resources. 
Nucleic Acids Res 42:D336-D346 

Altschul SF, Madden TL, Schaffer AA, 
Zhang J, Zhang Z, Miller W et al (1997) 
Gapped BLAST and PSI-BLAST: a new gener- 
ation of protein database search programs. 
Nucleic Acids Res 25:3389-3402 

Kelley LA, Mezulis S, Yates CM, Wass MN, 
Sternberg MJE (2015) The Phyre2 web portal 
for protein modeling, prediction and analysis. 
Nat Protoc 10:845-858 

Roy A, Kucukural A, Zhang Y (2010) 
I-TASSER: a unified platform for automated 
protein structure and function prediction. Nat 
Protoc 5:725-738 

Berman HM, Westbrook J, Feng Z, 
Gilliland G, Bhat TN, Weissig H et al (2000) 
The protein data bank. Nucleic Acids Res 28: 
235-242 

Hornak V, Abel R, Okur A, Strockbine B, 
Roitberg A, Simmerling C (2006) Comparison 
of multiple Amber force fields and develop- 
ment of improved protein backbone para- 
meters. Proteins 65:712—725 

Eastman P, Swails J, Chodera JD, McGibbon 
RT, Zhao Y, Beauchamp KA et al (2017) 
OpenMM 7: rapid development of high per- 
formance algorithms for molecular dynamics. 
PLoS Comput Biol 13:e1005659 

Pereira J, Simpkin AJ, Hartmann MD, Rigden 
DJ, Keegan RM, Lupas AN (2021) High- 
accuracy protein structure prediction in 
CASP14. Proteins. https://doi.org/10. 
1002/prot.26171 

Mirdita M, Ovchinnikov S, Steinegger M 
(2021) ColabFold - Making protein folding 
accessible to all 
bioRxiv. p. 2021.08.15.456425. https://doi. 
org/10.1101/2021.08.15.456425 
Tunyasuvunakool K, Adler J, Wu Z, Green T, 
Zielinski M, Zidek A et al (2021) Highly accu- 
rate protein structure prediction for the human 
proteome. Nature. https://doi.org/10.1038/ 
s41586-021-03828-1 


76 


27; 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


Romain Launay et al. 


Lovell SC, Robertson DL (2010) An 
integrated view of molecular coevolution in 
protein--protein interactions. Mol Biol Evol 
27:2567-2575 

Atchley WR, Wollenberg KR, Fitch WM, 
Terhalle W, Dress AW (2000) Correlations 
among amino acid sites in bHLH protein 
domains: an information theoretic analysis. 
Mol Biol Evol 17:164—178 


Marks DS, Colwell LJ, Sheridan R, Hopf TA, 
Pagnani A, Zecchina R et al (2011) Protein 3D 
structure computed from — evolutionary 
sequence variation. PLoS One 6:e28766 
Hopf TA, Scharfe CPI, Rodrigues JPGLM, 
Green AG, Kohlbacher O, Sander C et al 
(2014) Sequence co-evolution gives 3D con- 
tacts and structures of protein complexes. elife 
3. https://doi.org/10.7554/eLife.03430 
Clark GW, Dar V-U-N, Bezginov A, Yang JM, 
Charlebois RL, Tillier ERM (2011) Using 
coevolution to predict protein-protein interac- 
tions. Methods Mol Biol 781:237-256 


Green AG, Elhabashy H, Brock KP, 
Maddamsetti R, Kohlbacher O, Marks DS 
(2021) Large-scale discovery of protein inter- 
actions at residue resolution using co-evolution 
calculated from genomic sequences. Nat Com- 
mun 12:1396 


Cong Q, Anishchenko I, Ovchinnikov S, Baker 
D (2019) Protein interaction networks 
revealed by proteome coevolution. Science 
365:185-189 

Iserte J, Simonetti FL, Zea DJ, Teppa E, 
Marino-Buslje C (2015) I-COMS: Interpro- 
tein-COrrelated mutations server. Nucleic 
Acids Res 43:W320-W325 

Chen R, Li L, Weng Z (2003) ZDOCK: an 
initial-stage protein-docking algorithm. Pro- 
teins 52:80-87 

Kozakov D, Hall DR, Xia B, Porter KA, 
Padhorny D, Yueh C et al (2017) The ClusPro 
web server for protein-protein docking. Nat 
Protoc 12:255-278 

Ritchie DW, Kozakov D, Vajda S (2008) Accel- 
erating and focusing protein-protein docking 
correlations using multi-dimensional rotational 
FFT generating functions. Bioinformatics 24: 
1865-1873 

Ritchie DW, Kemp GJ (2000) Protein docking 
using spherical polar Fourier correlations. Pro- 
teins 39:178-194 

Garzon JI, Lopéz-Blanco JR, Pons C, Kovacs J, 
Abagyan R, Fernandez-Recio J et al (2009) 
FRODOCK: a new approach for fast rotational 
protein-protein docking. Bioinformatics 25: 
2544-2551 


40. 


41. 


42. 


43. 


44. 


45. 


46. 


47. 


48. 


49. 


50. 


51. 


52. 


53. 


Christoffer C, Chen S, Bharadwaj V, 
Aderinwale T, Kumar V, Hormati M et al 
(2021) LZerD webserver for pairwise and mul- 
tiple protein-protein docking. Nucleic Acids 
Res 49:W359-W365 

Dominguez C, Boelens R, Bonvin AMJ (2003) 
HADDOCK: a protein—protein docking 
approach based on biochemical or biophysical 
information. J Am Chem Soc:1731-1737. 
https: //doi.org/10.1021/ja026939x 

Cheng TM-K, Blundell TL, Fernandez-Recio J 
(2007) pyDock: electrostatics and desolvation 
for effective scoring of rigid-body protein-pro- 
tein docking. Proteins 68:503-515 

Pierce B, Weng Z (2007) ZRANK: reranking 
protein docking predictions with an optimized 
energy function. Proteins 67:1078-1086 
Pierce B, Weng Z (2008) A combination of 
rescoring and refinement significantly improves 
protein docking performance. Proteins 72: 
270-279 

Moal IH, Bates PA (2010) SwarmDock and the 
use of normal modes in protein-protein dock- 
ing. Int J Mol Sci 11:3623-3648 

Ritchie DW, Venkatraman V (2010) Ultra-fast 
FFT protein docking on graphics processors. 
Bioinformatics 26:2398-2405 

Ohue M, Shimoda T, Suzuki S$, Matsuzaki Y, 
Ishida T, Akiyama Y (2014) MEGADOCK 4.0: 
an ultra-high-performance _ protein-protein 
docking software for heterogeneous supercom- 
puters. Bioinformatics 30:3281-3283 

Lu H, Lu L, Skolnick J (2003) Development of 
unified statistical potentials describing protein- 
protein interactions. Biophys J 84:1895-1901 
Huang S-Y, Zou X (2008) An_ iterative 
knowledge-based scoring function for protein- 
protein recognition. Proteins 72:557-579 
Mezei M (2017) Rescore protein-protein 
docked ensembles with an interface contact 
statistics. Proteins 85:235-241 

Khashan R, Zheng W, Tropsha A (2012) Scor- 
ing protein interaction decoys using exposed 
residues (SPIDER): a novel multibody interac- 
tion scoring function based on frequent geo- 
metric patterns of interfacial residues. Proteins 
80:2207-2217 

Kozakov D, Brenke R, Comeau SR, Vajda S 
(2006) PIPER: an FFT-based protein docking 
program with pairwise potentials. Proteins 65: 
392-406 

Liang S, Merouch SO, Wang G, Qiu C, Zhou Y 
(2009) Consensus scoring for enriching near- 
native structures from protein-protein docking 
decoys. Proteins 75:397-403 


54. 


55. 


56. 


57. 


58. 


59. 


60. 


6l. 


62. 


Modeling Protein Complexes and Molecular Assemblies Using Computational Methods 


Feliu E, Aloy P, Oliva B (2011) On the analysis 
of protein-protein interactions via knowledge- 
based potentials for the prediction of protein- 
protein docking. Protein Sci 20:529-541 
Vreven T, Hwang H, Weng Z (2011) Integrat- 
ing atom-based and _ residue-based scoring 
functions for protein-protein docking. Protein 
Sci 20:1576-1586 


Andreani J, Faure G, Guerois R (2013) Inter- 
EvScore: a novel coarse-grained interface scor- 
ing function using a multi-body statistical 
potential coupled to evolution. Bioinformatics 
29:1742-1749 


Yu J, Andreani J, Ochsenbein F, Guerois R 
(2017) Lessons from (co-)evolution in the 
docking of proteins and peptides for CAPRI 
Rounds 28-35. Proteins 85:378-390 


Oliva R, Vangone A, Cavallo L (2013) Ranking 
multiple docking solutions based on the con- 
servation of inter-residue contacts. Proteins 81: 
1571-1584 


Oliva R, Chermak E, Cavallo L (2015) Analysis 
and ranking of protein-protein docking models 
using inter-residue contacts and __ inter- 
molecular contact maps. Molecules 20: 
12045-12060 


Vangone A, Cavallo L, Oliva R (2013) Using a 
consensus approach based on the conservation 
of inter-residue contacts to rank CAPRI mod- 
els. Proteins 81:2210-2220 


Chermak E, Petta A, Serra L, Vangone A, 
Scarano V, Cavallo L et al (2015) CON- 
SRANK: a server for the analysis, comparison 
and ranking of docking models based on inter- 
residue contacts. Bioinformatics 31:1481- 
1483 

Chermak E, De Donato R, Lensink MF, 
Petta A, Serra L, Scarano V et al (2016) Intro- 
ducing a clustering step in a consensus 


approach for the scoring of protein-protein 
docking models. PLoS One 11:c0166460 


63. 


64. 


65. 


66. 


67. 


68. 


69. 


70. 


71. 


77 


Lensink MF, Méndez R, Wodak SJ (2007) 
Docking and scoring protein complexes: 
CAPRI 3rd edition. Proteins 69:704—718 


Lensink MF, Velankar S, Wodak SJ (2017) 
Modeling protein-protein and protein-peptide 
complexes: CAPRI 6th edition. Proteins 85: 
359-377 


Lensink MF, Wodak SJ (2010) Docking and 
scoring protein interactions: CAPRI 2009. 
Proteins 78:3073-3084 

Pierce BG, Wiehe K, Hwang H, Kim B-H, 
Vreven T, Weng Z (2014) ZDOCK server: 
interactive docking prediction of protein- 
protein complexes and symmetric multimers. 
Bioinformatics 30:1771-1773 


Quignot C, Postic G, Bret H, Rey J, Granger P, 
Murail S et al (2021) InterEvDock3: a com- 
bined template-based and free docking server 
with increased performance through explicit 
modeling of complex homologs and integra- 
tion of covariation-based contact maps. 
Nucleic Acids Res 49:W277-W284 

Karami Y, Guyon F, De Vries S, Tufféry P 
(2018) DaReUS-Loop: accurate loop model- 
ing using fragments from remote or unrelated 
proteins. Sci Rep 8:13673 

Horsefield R, Iwata S, Byrne B (2004) Com- 
plex II from a structural perspective. Curr Pro- 
tein Pept Sci 5:107-118 

Yankovskaya V, Horsefield R, Toérnroth S, 
Luna-Chavez C, Miyoshi H, Léger C et al 
(2003) Architecture of succinate dehydroge- 
nase and reactive oxygen species generation. 
Science 299:700-704 

Ruprecht J, Yankovskaya V, Maklashina E, 
Iwata S, Cecchini G (2009) Structure of Escher- 
ichia coli succinate:quinone oxidoreductase 
with an occupied and empty quinone-binding 
site. J Biol Chem 284:29836-29846 


® 


Check for 
updates 


Chapter 5 


From Genome Mining to Protein Engineering: A Structural 
Bioinformatics Route 


Derek J. Smith 


Abstract 


This chapter outlines applications in genome mining, along with computational methods to predict protein 
structure and protein-ligand docking. It offers a simple computational route to rapidly identify proteins of 
interest from genomic and proteomic data, to accurately predict their three-dimensional structures, and to 
dock small molecules to their binding pockets and strategies to improve their biophysical properties 
depending on the needs of the experimental researcher. 


Key words Genome mining, Protein structure prediction, Structural bioinformatics, Small molecule 
docking, Protein engineering, Directed evolution 


1. Introduction 


1.1. Genome Mining: 
Finding Your Needle in 
a Haystack 


The recent rise of rapid genome sequencing and annotation has 
produced a wealth of data, almost dazzling in its scope, for experi- 
mental and theoretical researchers. Genomes can be parsed to 
identify sequences for potential roles in pharmaceuticals or the 
chemical industry. Also, advances in both protein structure predic- 
tion methods, particularly in the areas of artificial intelligence and 
deep learning, have seen the production of almost atomic-level 
accuracy models. Coupled with this, the use of protein-ligand 
scoring functions and rapid docking allows the researcher tools 
for the interrogation of increasingly accurate predictive models 
for the interaction of proteins with potential drugs or fine chemical 
substrates. 


Advances in modern gene sequencing technologies have enabled 
the production of a vast amount of biological sequence data in a 
very short time. The researcher keen to understand the origin of his 
or her metabolite of interest is often now able to decode the 
biosynthetic pathway through genome mining, given a fully 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_5, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


79 


80 


Derek J. Smith 


sequenced genome of the organism. Genome mining, as a subset of 
data mining, involves the interrogation of genomic sequences, 
based on knowledge of enzymatic reactions and conserved protein 
sequences. It can be used to identify novel sequences that encode 
for specific enzymatic activity towards metabolites. It enables an 
in-depth understanding of metabolite biosynthesis, all the way 
from single enzyme reactions to full biosynthetic pathways and 
their means of regulation [1]. 

Many tools are available for the interrogation of genome, 
transcriptome, and proteome sequences, but here we will focus 
on three standard tools looking at different levels of data. The 
most common sequence search tool is the Basic Local Alignment 
Search Tool (BLAST) software [2]. BLAST allows rapid compari- 
son between biological sequences (nucleotide and protein) and can 
identify both long and short regions of similarity between a query 
sequence and searchable sequence databases. The BLAST software 
is actually a suite of programs for dealing with sequence comparison 
for protein-protein searching (blastp) and nucleotide-nucleotide 
searching (blastn). For more sensitive searching, blastx translates a 
nucleotide query to protein, tblastn translates a nucleotide database 
to protein, and tblastx translates both nucleotide query and data- 
base to protein, all of these for protein-protein searching. It is ideal 
for the identification of single sequences. 

A more sophisticated search program is HMMER (pronounced 
“hammer”) [3]. HMMER is also a suite of programs, but instead of 
using individual sequences for search queries, it uses sequence 
profiles. Evolutionarily related sequences are aligned and used to 
encode profiles along the sequence in the form of a hidden Markov 
model (HMM). These profile models represent the likelihood of 
insertions and deletions along the sequence, as well as conserved 
sequence positions and blocks, and are peculiar to individual 
sequence families. This HMM method enables a very sensitive 
scoring and identification of entire sequence families in a genome 
and has been used to identify sequences with novel functions in 
both genomes [4] and transcriptomes [5 ]. 

As well as locating single sequences and families, one further 
level of searching that is relevant to this discussion is the identifica- 
tion of biosynthetic gene clusters (BGCs). For the production of 
natural products and other secondary metabolites, organisms tend 
to organize their biosynthesis genes into clusters, with the most 
well-known being the observance of operons (long lengths of DNA 
encoding entire biosynthetic pathways) in bacteria [6]. This is also 
seen in eukaryotes such as fungi [7], and even in plants, where 
genes for specific biosynthetic pathways are seen to be co-localized 
(e.g., terpene synthases and P450s, phenylpropanoids, alkaloids, 
and plant defense compounds [8]). One commonly used program 
for identifying BGCs is the antibiotics and Secondary Metabolite 
Analysis SHell (antisMASH) [9]. antiSMASH is available as both a 
web server and as a downloadable stand-alone program. It can be 


1.2 Protein Structure 
Prediction: Then 
and Now 


From Genome Mining to Protein Engineering 81 


used to identify, annotate, and compare potential BGCs from bac- 
teria, and versions are also available for fungal and plant genomes. 
An uploaded genome first undergoes gene prediction, followed by 
gene cluster identification. It also can predict the chemical struc- 
tures of potential products and perform protein domain analyses 
and comparisons with related clusters from other organisms. 


Having obtained the sequence(s) of interest, cloning and expres- 
sion to identify properties and activity are next. One aspect that is 
often neglected in a molecular biology environment is to try to 
obtain the structure — or at least a reasonable model — of the 
expressed protein. This is key for understanding activity: the fumc- 
tion of the protein is itself a function of the three-dimensional 
structure of that protein. For example, an enzyme works by folding 
in such a way as to bring catalytic amino acids together in space, or 
by binding a reactive cofactor molecule, together with a chemical 
environment (polar, charged, hydrophobic, or a combination of 
these) that permits the binding of a chemical substrate, a reaction, 
and then egress of the products. 

The problem arising is that it is not easy to predict the struc- 
tures of proteins de novo. Levinthal’s paradox suggests that an 
astronomical amount of time would be required to fold the protein, 
and as the folding actually occurs much faster than this, the protein 
folds not sequentially but through intermediate states [10]. 

The most common ways of experimentally determining protein 
structures are x-ray crystallography and nuclear magnetic resonance 
(NMR). These standard tools are excellent ways of obtaining struc- 
tures but often involve a lot of experimental effort. For x-ray 
crystallography, these include growing regular crystals that diffract, 
as well as solving the phase problem through molecular replace- 
ment or heavy atom isomorphous replacement techniques. In the 
case of NMR, there is often the requirement of expensive enriched 
isotopes, which must be incorporated into the expressed protein. 

Protein structure prediction through computational methods 
is a faster route to get to a model protein structure. Historically, 
protein structure prediction has been dominated by what is often 
referred to as homology or comparative modeling. Chothia and Lesk 
[11] identified that evolutionally related sequences are likely to 
have similar structures, and the closer the sequences, the closer 
the structures, especially in the conserved core regions. This 
became an early basis for comparative modeling of protein 
sequences, which can be outlined as follows: 


A. Starting with a target sequence of interest, the Protein Data 
Bank (PDB) of deposited 3D structures [12] is searched using 
BLAST for template structures related to the target sequence. 
If no related structure is found, a threading algorithm may be 
used to try to identify more distantly related structures [13]. 


82 Derek J. Smith 


1.2.1 Programs and 
Servers for Comparative 
Modeling 


1.2.2 The Rise of 
AlphaFold 


B. These structures are aligned to the target sequence to produce 
a sequence-structure alignment. This allows the identification 
of structurally conserved regions (SCRs) of main chain back- 
bone and structurally variable regions (SVRs) which corre- 
spond roughly to loop regions. 


C. The model is constructed by copying the SCRs, along with 
conserved sidechains to form the basis of the model. This is 
known as fragment matching [14]. Missing loops are either 
added from loop libraries or constructed de novo, and then 
sidechains are added, usually from rotamer libraries of 
observed sidechain conformations. Another method of con- 
struction, known as segment matching [15], relies on copying 
short lengths of conserved sequence-structure regions across 
the whole model. A third method, the satisfaction of spatial 
restraints, identifies geometric features of the template struc- 
tures and converts them to probability density functions. 
These are then optimized across the whole model to generate 
a 3D representation similar to the protocol used in NMR 
protein structure determination [16]. 


D. The model is then refined, usually by energy minimization or 
molecular dynamics, and assessed by stereochemical fidelity to 
known protein structures, as well as the use of knowledge- 
based potentials of protein folding [17]. If the model is found 
to have errors, the cycle can be repeated, with changes in 
template structures, or altering the sequence-structure 
alignment. 


Many comparative modeling programs are available for use for the 
researcher. One of the most popular stand-alone programs is 
MODELLER [16]. This was the original method of protein mod- 
eling by satisfaction of spatial restraints and is still a rapid and 
reasonably accurate method of obtaining structural models. It 
does not have a GUI but can be used as part of the PyMod plugin 
[18] for PyMOL visualization software [19]. This allows for a 
complete sequence searching and structure modeling package for 
academic use. 

Automatic structure prediction servers can also produce high- 
quality models, such as HHpred [20], I-TASSER [21], Robetta 
[22], and RaptorX [23], which are also evaluated on a weekly basis 
using the Continuous Automated Model EvaluatiOn server 
(CAMEO) [24]. Most of these server-based programs can also be 
downloaded for individual or group usage. 


The recent success of the AlphaFold2 program at the 14th Critical 
Assessment of Techniques for Protein Structure Prediction 
(CASP14) made headlines worldwide [25]. AlphaFold directly 
predicts protein structure using only the target sequence and a 
multiple sequence alignment as inputs. The inputs first go through 


1.3. Computational 
Docking of Small 
Molecules 


1.3.1 Available Docking 
Software 


1.4 Protein 
Engineering for 
Alteration of 
Functional Properties 


From Genome Mining to Protein Engineering 83 


a neural network block known as Evoformer, which processes the 
alignment and target sequence into arrays. These then enter the 
structure module, where an explicit 3D representation of each 
residue is optimized through rotation and translation to obtain a 
highly accurate model. The assessors of CASP14 considered the 
accuracy of AlphaFold2 for nearly two-thirds ofits predictions to be 
competitive with that of experimental methods (~1 A deviation for 
the protein backbone). Other related programs such as trRosetta 
[26] and RoseTTAFold [27] also offer highly accurate structure 
models. 


Having obtained the protein structure model, the researcher may 
now require the use of docking software to dock small molecules to 
the protein. Many proteins interact with small molecules and are 
involved in processes including catalysis and signal transduction. If 
the researcher is modeling an enzyme of interest, small molecule 
substrates, products, and inhibitors may be of value to include 
within the model for research. The goal is to predict the preferred 
orientation of the small molecule (or “ligand”) relative to the 
protein (or “receptor”) in the formation of a stable protein-ligand 
complex. This orientation can then be used to predict binding 
affinity of the ligand for the receptor protein through the use of a 
scoring function. These techniques are most commonly used in 
structure-based drug design and engineering of proteins towards 
desired substrates or products. 

The basic protocol involves (a) a search algorithm coupled with 
(b) a scoring function. The docking problem can be approached as 
matching the shape complementarity [28] or the pairwise interac- 
tion energies [29] of the ligand and receptor. The searching algo- 
rithm involves systematic searching of the optimal ligand binding 
pose. This can be done by exploring all rotatable bonds of the 
ligand [30] and molecular dynamics simulations [31] or by genetic 
algorithms [32]. The various poses are evaluated by a scoring 
function — usually a physics-based molecular mechanics function 
that calculates the energy of binding for the ligand [33]. 


Due to its importance in structure-based drug design, many com- 
mercial docking programs are available including Glide [34], 
GOLD [32], and MOE [35]. Some small molecule docking servers 
are also available, such as SwissDock [36], PatchDock [37], and 
EADock [38]. Software that is for download and free for academic 
use include DOCK [39], AutoDock [40], and AutoDock Vina 
[41, 42]. 


Most proteins operate at physiological conditions, and for most 
research purposes, this is not an issue. Given a structure, or accurate 
protein structure model, it is possible to identify catalytic residues 
and binding pockets and also obtain docked substrates /inhibitors 
as already discussed. However, when it comes to the use of proteins 


84 Derek J. Smith 


2 Methods 


as biocatalysts for industrial purposes — a field becoming more and 
more active due to the potential for environmentally friendly bio- 
synthesis of fine or bulk chemicals — this becomes a problem. 
Industrial reaction conditions, including higher temperatures, as 
well as substrate and product concentrations required for large- 
scale chemical biosynthesis, often lead to loss of stability, causing 
poor enzyme efficiency and loss of activity. 

Certain strategies can be effectively engaged to improve protein 
physical and reactive properties to produce active biocatalysts 
beyond physiological conditions. Directed protein evolution, com- 
bined with rational and semi-rational approaches to mutational 
library design, is a very effective means of improving protein activity 
under industrial conditions. This process has four basic steps: 


A. Start with an appropriate sequence — referred to as the “back- 
bone.” This is a sequence that possesses some activity (however 
small) on the substrate at physiological conditions, or pre- 
dicted to be active given certain mutations within the active 
site (if the substrate is non-native). 


B. Library construction on the backbone. Directed evolution 
involves the use of “libraries” of mutations. The most basic 
form of library construction involves error-prone PCR 
[43]. Other, more targeted libraries for specific amino acid 
residues involve site-saturated mutagenesis [44] for single 
positions and potentially the whole enzyme, as well as pre- 
dicted single mutations. Combinatorial libraries can be used 
for optimizing multiple positions together. 

C. The mutants are then screened at the required conditions, or 
conditions close to those needed. Most mutations will be 
deleterious or neutral, but a number will be beneficial [45]. 


D. The beneficial mutations are recombined onto the most active 
hit from the first round of evolution, often through a combi- 
natorial library. The cycle then begins again until the desired 
conditions are met by the evolved biocatalyst. 


This cycle can be used to improve any functional property of 
the protein of interest. For most basic research, a few rounds of 
evolution is all that is necessary for proof of concept. This is 
illustrated well in a study for production of simvastatin by the 
LovD enzyme [46], with a parallel, longer optimization of the 
same enzyme for industrial production [47]. 


The overall pipeline is illustrated in Fig. 1. All methods described in 
this work can be run on a standard Linux OS-based laptop, desk- 
top, or workstation. Many of them are also compatible with Mac 
OX/S and Microsoft Windows, allowing for ease of use for any 


From Genome Mining to Protein Engineering 85 


D Directed Evolution 
{ 


Fig. 1 A pipeline for interrogation of protein sequences. The sequence of interest is obtained by genome 
mining (a), followed by three-dimensional protein structure prediction using AlphaFold (b). This model can 
then be used to dock substrates/ligands of interest with AutoDock Vina (ec), laying the groundwork for directed 
evolution to improve protein activity, stability, or other functional parameters (d) 


experimental researcher. For visualization of protein structures and 
models, PyMOL (Schrodinger) is recommended [19]. This is avail- 
able to download for a subscription to an academic license with 
newer versions, but older, freely available unsupported versions 
may also be found online. 


2.1 Searching BLAST software is obtained from the NCBI website (https:// 
Genomes and blast.ncbi.nlm.nih.gov/Blast.cgi7PAGE_TYPE=BlastDocs & 
Proteomes with BLAST DOC_TYPE=Download), and HMMER software can be found 
and HMMER at the HMMER homepage (http://hmmer.org). BLAST databases 


are also available for download, but here we will describe how to 
create a searchable sequence database for BLAST using a genome/ 
proteome FASTA file. The BLAST software includes a script called 
“makeblastdb,” which performs the necessary conversion. Having 
installed the BLAST software, and given a proteome sequence file 
“proteome.fa,” the following Unix command may be used to create 
the database (see Note 1): 


$ makeblastdb -in proteome.fa -parse_seqids -blastdb_version 
5 -title “Proteome Database” -dbtype prot 


If a nucleotide sequence file is used, then the dbtype should be 
set to “nucl.” To search the newly created database with protein- 
protein blast (blastp) using a query sequence “query.fa,” the fol- 
lowing command is used: 


$ blastp -db proteome.fa -query query.fa -out query_results.out 


This can be tailored for any of the BLAST programs. 

HMMER software only requires a multi-sequence FASTA file 
without the requirement for conversion to a database. Profile 
HMMs are found in the “Pfam-A.hmm” file and can be obtained 
by using the “hmmfetch” command to retrieve individual HMM 


86 Derek J. Smith 


2.1.1 Searching for 
Biosynthetic Gene Clusters 
with antiSMASH 


2.2 Modeling Protein 
Structures with 
AlphaFold Through 
Google Colab 


profiles of interest (see Note 2). The profiles can be used to search 
the proteome sequence file as follows: 


$ hmmsearch profile-hmm proteome.fa > profile_results.out 


antiSMASH is available for download as well as offered as a server 
(https: //antismash.secondarymetabolites.org/). Links to the fun- 
gal and plant versions are on the website. The inputs are a genome 
file in FASTA or GenBank format and an annotation file in GFF3 
format (see Note 3). The web-based output allows for full inter- 
rogation of potential BGCs, as well as downloading the identified 
sequences for further study. 


AlphaFold2 software is available for download but requires the use 
of a GPU cluster. However, an alternative exists for the researcher 
with restricted computational power: Google Colaboratory offers 
“ColabFold” as a service to the scientific community [48]. The 
researcher can access the Colab notebook for AlphaFold and gen- 
erate accurate protein models within an hour or two, in an interac- 
tive setting, using Google’s GPU clusters (see Note 4). The basic 
protocol is as follows: 


A. Access the Colab notebook (https://colab.research.google. 
com/github/sokrypton/ColabFold/blob/main/ 
AlphaFold2.ipynb). 


B. Paste the target sequence into the notebook and add a name 
for your project. AlphaFold is optimized also for multimers 
and complexes. If you wish to model a dimer, paste 
SEQUENCE1:SEQUENCEL, using a colon as a break. For a 
02B2 tetramer, you would paste SEQUENCE1:SEQUENCEL: 
SEQUENCE2:SEQUENCE2. 


C. Check “use_amber” and “use_templates.” The use of tem- 
plates does not affect the overall model structure, but as they 
can be used as extra restraints in the prediction, it is worth 
adding them, and sidechains must be optimized through 
AMBER [49]. Although it can double the runtime for struc- 
ture prediction, accurate sidechain prediction is essential for 
any further use of the model for docking studies, or protein 
engineering. 

D. Go to the Runtime tab and click “run all.” The server runs 
interactively, so the browser window must remain open at all 
times (a subscription is required to keep data if the window is 
accidentally closed, or the computer goes into sleep mode). 


E. After completion, a .zip file is created containing all of the 
results and automatically downloaded to the computer. Colab- 
Fold outputs five amber-relaxed and unrelaxed models, the 
input multiple sequence alignment generated from sequence 


Sequence coverage 


From Genome Mining to Protein Engineering 87 


Cc rank_1 


Predicted IDOT per position 


Fig. 2 Results from a ColabFold structure prediction for the protein product of the chalcone synthase gene 
Cav01g29270 from hazelnut. Outputs include a plot of sequence coverage of the multiple sequence alignment 
(MSA) to the target sequence (a), a plot of predicted local distance difference test (pLDDT) scores (b) for the 
five predicted models (colored by chain), and a plot of predicted aligned error (PAE) between the chains for the 
highest ranked dimeric model (c). The pLDDT score is very high across the model, and the PAE score between 
the chains is very low, indicating a confident prediction of the protein structure. This is a function of the very 
high sequence coverage of the MSA across this sequence. The worst scoring region is the first 20 residues at 
the N-terminus of the sequence, which have little to no sequence coverage in the MSA. (d) shows the highest 
ranking structure obtained colored by pLDDT (spectrum of red (50 and below) to blue (90 and above)), showing 
both the overall high score for this model and the low-scoring N-terminal residues 


2.3 Small Molecule 
Docking with 
AutoDock Vina 


databases, as well as several plots including sequence coverage 
and predicted local distance difference test (pLDDT) scores for 
all residues. 


Figure 2 shows some typical results for a ColabFold run. The 
sequence used here is the predicted protein sequence of 
Cav01g29270, a chalcone synthase gene obtained from the 
recently sequenced genome of the European hazelnut (Corylus 
avellana L.) [50]. The chalcone synthase family is a large sequence 
family, and a good quality model was obtained due to the high 
sequence coverage of the multiple sequence alignment obtained 
against most of the length of the predicted sequence. 


AutoDock Vina is a new generation of AutoDock. First released in 
2010 from the Olsen group at the Scripps Research Institute [41], 
Vina enabled fast, accurate small molecule docking with limited 
computational power. It also included the treatment of the receptor 
as flexible (identified flexible sidechains in the binding pocket could 
be included in the search space). It has recently undergone a 
revision (version 1.2) [42] which allows it to use the AutoDock 


88 Derek J. Smith 


2.4 Protein 
Engineering Strategies 
for Structural Models 


scoring function (AutoDock4.2). It also has expanded capacities for 
specific water molecule inclusion and simultaneous docking of 
multiple ligands (ideal for where an enzyme binds a cofactor as 
well as a substrate, or for looking at binding of enzymatic cleavage 
products). A short but informative video tutorial is available online 
(see Note 5), but we shall consider the basics here: 


A. The associated program AutoDockTools can be downloaded 
and used to do three things. Firstly, the protein is prepared for 
docking by adding polar hydrogens (all non-polar hydrogens 
are removed and are implicitly treated as part of the non-polar 
heavy atom they are bonded to). Protonation states for active 
site histidines should be checked. Also, the relative rotameric 
conformation of histidine, asparagine, and glutamine 
(“HNQ”) should be checked to ensure optimal hydrogen- 
bonding networks using an online server such as WHAT IF 
[51]. Secondly, a grid that encompasses the entire binding 
pocket is calculated, and dimensions/central origin can be 
noted down in a text file to define the search space. The ligand 
is also parameterized and all rotatable bonds identified for the 
search. The protein and ligand are stored as PDBQT files, 
which are based on the PDB format, but include atomic 
charges (and rotation information for the ligand). 


B. A configuration file (“docking.conf”) can be created where the 
protein and ligand PDBQT files are specified, as well as the 
name of the file to be saved containing the docked poses and 
the calculated grid dimensions. 


C. The program can then be run using the following command: 
$>vina --config docking.conf --log docking.log 
where vina stands for the full path to your installation of 
the program. The docked models can then be assessed either in 
AutoDockTools, or in PYMOL. Lowest energy docking poses 
can be checked to see whether they are in appropriate positions 
to affect protein activity (i.e., is this substrate in an appropriate 
position for enzyme catalysis, etc.). 


Having a three-dimensional protein model (with a docked sub- 
strate /inhibitor/ligand) is a good start for a protein engineering 
project. Here we will discuss a few simple strategies for improve- 
ment of physical and functional properties of proteins. 

The first thing is to make a list of the amino acid residues in 
different parts of the structure. Using the docked substrate, it is 
straightforward in PyMOL or other protein visualization tools to 
identify important clusters of residues. It is helpful to divide the 
whole protein structure into four (or five) basic bins: 


© The active site/binding pocket. Calculate all amino acid residues 
<4 A away from the bound substrate. This should include the 
catalytic residues of your enzyme and the constellation of 


2.4.1 


General Stability 


From Genome Mining to Protein Engineering 89 


residues that make up the substrate-binding pocket. These posi- 
tions are most likely to affect the activity of the enzyme and the 
specificity and stereoselectivity of the reaction products. 


¢ The cofactor-binding residues. This is an optional bin for those 
enzymes which require a small-molecule cofactor for activity. 
Again, calculate all positions within 4 A of the cofactor mole- 
cule. These positions can often affect the overall stability of the 
protein, as the binding of the cofactor by the protein adds to the 
stability through a chelation effect (the stronger the binding to 
the cofactor, the more stable the protein). 


© The secondary sphere residues. These are all positions between 
4 and 8 A of the substrate/ligand and may have an effect on 
specificity and activity due to their relative closeness to the active 
site and binding pocket. 


¢ Core residues. These are the rest of the residues found in the 
interior of the protein, mostly hydrophobic, and contribute 
more to overall stability. 


© Surface residues. All positions found on the surface of the 
enzyme. They may be fully exposed to solvent or partially buried 
in the core. 


For multimeric proteins/enzymes, another bin that may be of 
use is multimeric interfaces, which also contribute to overall stabil- 
ity (identifying all residues on that surface will be sufficient). This 
binning is not essential but is useful both in targeting for functional 
improvements and in interpreting observed improvements from a 
structural perspective, as one can easily identify the locations of 
positions that are potentially evolvable for functional purposes. 
We will now discuss strategies for improving three functional para- 
meters — overall stability, activity /specificity, and thermostability. 


A relatively straightforward way of library design for improving 
protein stability is to identify “most common” mutations relative 
to your sequence. The basic idea is that through the evolution of a 
particular family of proteins, many amino acids (both specific and 
types) are conserved to maintain the stability of the protein, regard- 
less of its specific activity. A basic strategy here would be to obtain a 
list of related sequences, from 40% to 25% sequence identity (the 
identity zone at which the structures are likely to be conserved 
across all sequences). This can be done using BLAST against a 
non-redundant protein sequence database. The sequences can 
then be aligned to produce a multiple sequence alignment using, 
for example, Clustal Omega [52]. This alignment can be used to 
identify all amino acids at all given positions, and percentage amino 
acids at each position can be calculated. If your sequence has, for 
example, isoleucine at position 14, and the greatest percentage at 
position 14 in your multiple sequence alignment is for valine, then 
you can take 114V as a potential stability-enhancing mutation. This 


90 Derek J. Smith 


2.4.2  Activity/Specificity 


2.4.3 Thermostability 


can be done across the sequence for as many positions as the 
researcher desires, and either individual mutations or, more effec- 
tively, a combinatorial library of mutations can be generated for 
your sequence and experimentally tested. The more stable the 
enzyme, the greater potential for further evolvability exists [53]. 


Here, the most likely place for alteration of enzyme activity or 
specificity is the active site and the binding pocket. Given a docked 
substrate-enzyme complex model, the residue positions around the 
substrate can be identified as above, and particular mutations can be 
specified to enlarge /reduce the pocket size and add complementary 
charged/polar or hydrophobic residues. This can be performed 
with both native and non-native substrates. For a more thorough 
analysis, site-saturation mutagenesis is an excellent way of identify- 
ing specific beneficial mutations, which can then be recombined in 
following rounds of evolution. However, improvements in activity 
can occur through mutation far from the active site [54], and error- 
prone PCR is also a useful strategy to obtain random beneficial 
mutations around the protein, which can be combined with 
activity-improving active site mutations. 


Thermostability is important for enzymes which may be required to 
perform reactions at higher temperatures. Many studies have iden- 
tified specific alterations to improve thermostability in proteins, and 
these can be used, along with structural details, to identify potential 
thermostable mutations [55]. The enzyme model and calculated 
residue bins mentioned above are very useful at this point. For core 
residues, increased branching of aliphatic sidechains (A > V; 
V > L/I) is often correlated with thermostability. For the surface, 
removal of flexible glycine and reactive sulfur-containing cysteine / 
methionine and increase in surface prolines to increase rigidity may 
also contribute. For polar surface residues, increasing the numbers 
of salt bridge pairs (D/E vs K/R) and removal of reactive aspartyl- 
prolyl motifs (DP) are recommended. The DP motifs are often 
found at the beginning of turns and helices, and mutation of 
aspartic acid to asparagine, serine, or threonine is suggested here. 
The model can be used to identify all these sites of potential 
improvement. 

For thermostable mutations to be identified, both an analysis of 
the structure and the multiple sequence alignment can be used to 
find positions with naturally occurring diversity that can be used to 
produce combinatorial libraries for improving thermostability. One 
useful tool is the Sorting Intolerant from Tolerant (SIFT) server 
[56]. This allows the user to submit his or her sequence of interest, 
performs a sequence similarity search, and calculates which amino 
acids are tolerated/not tolerated at all positions. This data can be 
used to identify potential sequence diversity for library 
construction. 


3 Conclusion 


4 Notes 


From Genome Mining to Protein Engineering 91 


Having set out a pipeline for interrogation and experimentation of 
a biological sequence from genome through to full protein engi- 
neering, it is hoped that the experimental researcher will be 
empowered to sit down at a workstation and augment his or her 
experimental data with atomic-level detail for increased under- 
standing of their protein(s) of interest. Many of these procedures 
have become faster and more accurate over time and possess great 
explanatory power, granting detailed insights into protein struc- 
ture, function, and engineering possibilities. 


1. Although the command for BLAST is used here, the full file 
location pathway of the BLAST program should be used here, 
e.g., “makeblastdb” may be located in /usr/local/bin/blast/ 
which should be added at the start. 


2. This is the simplest way of running HMMER for protein 
sequences. It can also be used for nucleotide sequences. To 
obtain the profile family names, run a query sequence through 
the PFAM website sequence search (http://pfam.xfam.org) 
and observe the protein family profiles identified in the search. 


3. The annotation file to be used with the antiSMASH software is 
to be checked to ensure it matches the genomic FASTA file, or 
the program quits within 15 min of running. 


4. The Google Colab version of AlphaFold can handle a maxi- 
mum of 1400 residues (monomer or multimers). A larger 
protein/protein complex would require local installation of 
the software. The accuracy of the model is dependent on 
both the number of related sequences in the multiple sequence 
alignment and the coverage of those sequences across the 
length of the target sequence. AlphaFold is very good for 
enzymes that match the traditional “lock-and-key” model of 
protein-ligand interactions. However, where the model is 
“induced fit,” or allostery is important, or the enzyme requires 
a large conformational change in order for activity, some qua- 
lifications are in order. This author has tried to model the 
epithelial growth factor receptor (EGFR) kinase domain and 
obtains accurate models of the “closed” conformation only. 
Likewise, AlphaFold does not model cofactors or metal ions, 
but as it is trained on the PDB structure database, some models 
may be “pre-organized” for inclusion of these cofactors, and 
docking may be relatively easy. Also, AlphaFold cannot accu- 
rately model the effect of single mutations on a structure (e.g., 


92 


Derek J. Smith 


some mutations are known to cause large conformational 
changes in kinases), but as the alignment only samples native 
sequences from databases and derives correlated mutations 
from these, all mutations added to the target sequence will be 
modeled in the same conformation as the “native” structure. 


5. Some differences are to be expected in the AutoDock Vina 
online tutorial as it dates back to 2010 (e.g., the option “all” 
is now called “out” in the newer Vina version). Refer to the 
updated manual for a more detailed tutorial. 


References 


1. 


Scherlach K, Hertweck C (2021) Mining and 
unearthing hidden biosynthetic potential. Nat 
Commun 12:3864. https://doi.org/10. 
1038 /s41467-021-24133-5 


. Ye J, McGinnis S, Madden TL (2006) BLAST: 


improvements for better sequence analysis. 
Nucleic Acids Res 34:W20-W25. https://doi. 
org/10.1093 /nar/gkl164 


. Eddy SR (2009) A new generation of homol- 


ogy search tools based on probabilistic infer- 
ence. Genome Inform 23:205-211. PMID: 
20180275 


. Kumar Y, Khan F, Rastogi S et al (2018) 


Genome-wide detection of terpene synthase 
genes in holy basil (Ocimum sanctum L.). 
PLoS One. https://doi.org/10.1371/jour 
nal.pone.0207097 


. Han XJ, Wang YD, Chen YC et al (2013) 


Transcriptome sequencing and expression anal- 
ysis of terpenoid biosynthesis genes in Litsea 
cubeba. PLoS One 8(10):e76890. https: //doi. 
org/10.1371/journal.pone.0076890 


. Lawrence JG (2002) Shared strategies in gene 


organization among prokaryotes and eukar- 
yotes. Cell 110:407-413. https://doi.org/10. 
1016/S0092-8674(02)00900-5 


. Robey MT, Caesar LK, Drott MT et al (2021) 


An interpreted atlas of biosynthetic gene clus- 
ters from 1,000 fungal genomes. Proc Natl 
Acad Sci USA — 118(19):e2020230118. 
https: //doi.org/10.1073/pnas.2020230118 


. Polturak G, Osbourn A (2021) The emerging 


role of biosynthetic gene clusters in plant 
defense and plant interactions. PLoS Pathog 
17(7):e1009698. https://doi.org/10.1371/ 
journal.ppat.1009698 


. Blin K, Shaw S, Kloosterman AM et al (2021) 


antiSMASH 6.0: improving cluster detection 
and comparison capabilities. Nucleic Acids 
Res 49:W29-W35. https://doi.org/10. 
1093 /nar/gkab335 


10. 


ll. 


12. 


13. 


14. 


15 


16. 


17. 


18. 


Rooman M, Dehouck Y, Kwasigroch JM et al 
(2002) What is paradoxical about Levinthal 
paradox? J Biomol Struct Dyn 20:327-329. 
https: //doi.org/10.1080/07391102.2002. 
10506850 


Chothia C, Lesk AM (1986) The relation 
between the divergence of sequence and struc- 
ture in proteins. EMBO J 5(4):823-826. 
https: //doi.org/10.1002/j.1460-2075.1986. 
tb04288.x 


Burley SK (2021) RCSB Protein Data Bank: 
powerful new tools for exploring 3D structures 
of biological macromolecules for basic and 
applied research and education in fundamental 
biology, biomedicine, biotechnology, bioengi- 
neering and energy sciences. Nucleic Acids Res 
49:D437-D451. https://doi.org/10.1093/ 
nar/gkaal038 

Peng J, Xu J (2010) Low-homology protein 
threading. Bioinformatics 26:1294-i300. 
https: //doi.org/10.1093 /bioinformatics/ 
btq192 

Bujnicki JM (2006) Protein-structure predic- 
tion by recombination of fragments. Chembio- 
chem 7(1):19-27. https://doi.org/10.1002/ 
cbic.200500235 


. Levitt M (1992) Accurate modeling of protein 


conformation by automatic segment matching. 
J Mol Biol 226(2):507-533. https: //doi.org/ 
10.1016/0022-2836(92)90964-1 

Sali A, Blundell TL (1993) Comparative pro- 
tein modelling by satisfaction of spatial 
restraints. J Mol Biol 234(3):779-815. 
https: //doi.org/10.1006/jmbi.1993.1626 
Sippl MJ (1995) Knowledge-based potentials 
for proteins. Curr Opin Struct Biol 5(2): 
229-235. https://doi.org/10.1016/0959- 
440x(95)80081-6 

Janson G, Paiardini A (2021) PyMod 3: a com- 
plete suite for structural bioinformatics in 
PyMOL. Bioinformatics 37:1471-1472. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


From Genome Mining to Protein Engineering 


https://doi.org/10.1093 /bioinformatics/ 
btaa849 


The PyMOL Molecular Graphics System, Ver- 
sion 2.4.1 Schrodinger, LLC 


Hildebrand A, Remmert M, Biegert A et al 
(2009) Fast and accurate automatic structure 
prediction with HHpred. Proteins 77 Suppl 9: 
128-132. https://doi.org/10.1002/prot. 
22499 


Yang J, Yan R, Roy A et al (2015) The 
I-TASSER Suite: protein structure and func- 
tion prediction. Nat Methods 12(1):7-8. 
https: //doi.org/10.1038 /nmeth.3213 


Kim DE, Chivian D, Baker D (2004) Protein 
structure prediction and analysis using the 
Robetta server. Nucleic Acids Res. 32: 
W526-W531. https://doi.org/10.1093/ 
nar/gkh468 

Kallberg M, Margaryan G, Wang S et al (2014) 
RaptorX server: a resource for template-based 
protein structure modeling. Methods Mol Biol 
1137:17-27. https://doi.org/10.1007/978- 
1-4939-0366-5_2 

Haas J, Barbato A, Behringer D et al (2018) 
Continuous Automated Model EvaluatiOn 
(CAMEO) complementing the critical assess- 
ment of structure prediction in CASP12. Pro- 
teins 86(Suppl 1):387-398. https://doi.org/ 
10.1002 /prot.25431 

Jumper J, Evans R, Pritzel A et al (2021) 
Highly accurate protein structure prediction 
with AlphaFold. Nature 596:583-589. 
https: //doi.org/10.1038/s41586-021- 
03819-2 


Yang J, Anishchenko I, Park H et al (2020) 
Improved protein structure prediction using 
predicted interresidue orientations. Proc Natl 
Acad Sci U S A 117(3):1496-1503. https: // 
doi.org/10.1073/pnas.1914677117 


Baek M, DiMaio F, Anishchenko I et al (2021) 
Accurate prediction of protein structures and 
interactions using a three-track neural network. 
Science 373:871-876. https://doi.org/10. 
1126/science.abj8754 


Shoichet K, Kuntz ID, Bodian DL (1992) 
Molecular docking using shape descriptors. J 
Comput Chem. 13:380-397. https://doi. 
org/10.1002/jcc.540130311 

Feig M, Onufriev A, Lee MS et al (2004) Per- 
formance comparison of generalized born and 
Poisson methods in the calculation of electro- 
static solvation energies for protein structures. 
J Comput Chem 25(2):265-284. https: //doi. 
org/10.1002/jcc.10378 

Wang J, Kollman PA, Kuntz ID (1999) Flexi- 
ble ligand docking: a multistep strategy 
approach. Proteins 36(1):1-19. https://doi. 


31 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40 


41. 


93 


org/10.1002/(SICI)1097-0134(19990701) 
36:1<1::AID-PROT1>3.0.CO;2-T 


. Guterres H, Im W (2020) Improving protein- 


ligand docking results with high-throughput 
molecular dynamics simulations. J Chem Inf 
Model 60(4):2189-2198. https://doi.org/ 
10.1021 /acs.jcim.0c00057 

Jones G, Willett P, Glen RC et al (1997) Devel- 
opment and validation of a genetic algorithm 
for flexible docking. J Mol Biol 267(3): 
727-748. https://doi.org/10.1006/jmbi. 
1996.0897 


Li J, Fu A, Zhang L (2019) An overview of 
scoring functions used for protein—ligand inter- 
actions in molecular docking. Interdiscip Sci 
Comput Life Sci 11:320-328. https://doi. 
org/10.1007/s12539-019-00327-w 


Repasky MP, Shelley M, Friesner RA (2007) 
Flexible ligand docking with Glide. In: Current 
protocols in bioinformatics. Wiley, New York 
Vilar S, Cozza G, Moro S (2008) Medicinal 
chemistry and the molecular operating envi- 
ronment (MOE): application of QSAR and 
molecular docking to drug discovery. Curr 
Top Med Chem 8(18):1555-1572. https:// 
doi.org/10.2174/156802608786786624 


Grosdidier A, Zoete V, Michielin O (2011) 
SwissDock, a protein-small molecule docking 
web service based on EADock DSS. Nucleic 
Acids Res 39:W270-W277. https://doi.org/ 
10.1093 /nar/gkr366 

Schneidman-Duhovny D, Inbar Y, Nussinov R 
et al (2005) PatchDock and SymmDock: ser- 
vers for rigid and symmetric docking. Nucleic 
Acids Res 33:W363-W367. https://doi.org/ 
10.1093 /nar/gki481 

Grosdidier A, Zoete V, Michielin O (2007) 
EADock: docking of small molecules into pro- 
tein active sites with a multiobjective evolution- 
ary optimization. Proteins 67(4):1010-1025. 
https: //doi.org/10.1002/prot.21367 

Morris GM, Huey R, Lindstrom W et al (2009) 
Autodock4 and AutoDockTools4: automated 
docking with selective receptor flexibility. J 
Comput Chem 30(16):2785-2791. https:// 
doi.org/10.1002/jcc.21256 


. Allen WJ, Balius TE, Mukherjee S et al (2015) 


DOCK 6: impact of new features and current 
docking performance. J Comput Chem 
36(15):1132-1156. https://doi.org/10. 
1002/jcc.23905 

Trott O, Olson AJ (2010) AutoDock Vina: 
improving the speed and accuracy of docking 
with a new scoring function, efficient optimiza- 
tion, and multithreading. J Comput Chem 
31(2):455-461. https://doi.org/10.1002/ 
jec.21334 


94 


42. 


43. 


44. 


45. 


46. 


47. 


48. 


49, 


Derek J. Smith 


Eberhardt J, Santos-Martins D, Tillack AF et al 
(2021) AutoDock Vina 1.2.0: new docking 
methods, expanded force field, and python 
bindings. J Chem Info Model 61(8): 
3891-3898. https://doi.org/10.1021/acs. 
jeim.1c00203 

Hanson-Manful P, Patrick WM (2013) Con- 
struction and analysis of randomized protein- 
encoding libraries using error-prone PCR. 
Methods Mol Biol 996:251-267. https://doi. 
org/10.1007/978-1-62703-354-1_15 

Siloto RMP, Weselake RJ (2012) Site satura- 
tion mutagenesis: methods and applications in 
protein engineering. Biocatal Agric Biotechnol 
1(3):181-189. https://doi.org/10.1016/j. 
bcab.2012.03.010 

Romero PA, Arnold FH (2009) Exploring pro- 
tein fitness landscapes by directed evolution. 
Nat Rev Mol Cell Biol 10(12):866-876. 
https: //doi.org/10.1038 /nrm2805 

Gao X, Xie X, Pashkov I et al (2009) Directed 
evolution and structural characterization of a 
simvastatin synthase. Chem Biol 16(10): 
1064-1074. https://doi.org/10.1016/j. 
chembiol.2009.09.017 

Jiménez-Osés G, Osuna S, Gao X et al (2014) 
The role of distant mutations and allosteric 
regulation on LovD active site dynamics. Nat 
Chem Biol 10(6):431-436. https://doi.org/ 
10.1038 /nchembio.1503 

Mirdita M, Ovchinnikov S, Steinegger M 
(2021) ColabFold - making protein folding 
accessible to all bioRxiv 2021.08.15.456425. 
https: //doi.org/10.1101/2021.08.15. 
456425 

Hornak V, Abel R, Okur A et al (2006) Com- 
parison of multiple Amber force fields and 
development of improved protein backbone 


50. 


51 


52 


53. 


54 


55. 


56. 


parameters. Proteins 65(3):712-725. https:// 
doi.org/10.1002/prot.21123 


Lucas SJ, Kahraman K, Avsar B et al (2021) A 
chromosome-scale genome assembly of 
European hazel (Corylus avellana L.) reveals 
targets for crop improvement. Plant J 105(5): 
1413-1430. https://doi.org/10.1111/tpj. 
15099 


. Hekkelman ML, Te Beek TA, Pettifer SR et al 


(2010) WIWS: a protein structure bioinfor- 
matics Web service collection. Nucleic Acids 
Res 38:W719-W723. https://doi.org/10. 
1093 /nar/gkq453 


. Sievers F, Wilm A, Dineen D et al (2011) Fast, 


scalable generation of high-quality protein 
multiple sequence alignments using Clustal 
Omega. Mol Syst Biol 7:539. https://doi. 
org/10.1038/msb.2011.75 

Bloom JD, Labthavikul ST, Otey CR et al 
(2006) Protein stability promotes evolvability. 
Proc Natl Acad Sci U S A 103(15):5869-5874. 
https: //doi.org/10.1073/pnas.0510098103 


. Subramanian K, Mitusinska K, Raedts J et al 


(2019) Distant non-obvious mutations influ- 
ence the activity of a hyperthermophilic Pyro- 
coccus furiosus phosphoglucose isomerase. 
Biomol Ther 9(6):212-218. https://doi.org/ 
10.3390/biom9060212 


Vieille C, Zeikus GJ (2001) Hyperthermo- 
philic enzymes: sources, uses, and molecular 
mechanisms for thermostability. Microbiol 
Mol Biol Rev 65(1):1-43. https://doi.org/ 
10.1128/MMBR.65.1.1-43.2001 

Sim NL, Kumar P, Hu J et al (2012) SIFT web 
server: predicting effects of amino acid substi- 
tutions on proteins. Nucleic Acids Res 40: 
W452-W457. https://doi.org/10.1093/ 
nar/gks539 


Check for 
updates 


Creating De Novo Overlapped Genes 


Dominic Y. Logel and Paul R. Jaschke 


Abstract 


Future applications of synthetic biology will rely on deploying engineered cells outside of lab environments 
for long periods of time. Currently, a significant roadblock to this application is the potential for deactivat- 
ing mutations in engineered genes. A recently developed method to protect engineered coding sequences 
from mutation is called Constraining Adaptive Mutations using Engineered Overlapping Sequences 
(CAMEOS). In this chapter we provide a workflow for utilizing CAMEOS to create synthetic overlaps 
between two genes, one essential (7wfA) and one non-essential (aroB), to protect the non-essential gene 
from mutation and loss of protein function. In this workflow we detail the methods to collect large numbers 
of related protein sequences, produce multiple sequence alignments (MSAs), use the MSAs to generate 
hidden Markov models and Markov random field models, and finally generate a library of overlapping 
coding sequences through CAMEOS scripts. To assist practitioners with basic coding skills to try out the 
CAMEOS method, we have created a virtual machine containing all the required packages already installed 
that can be downloaded and run locally. 


Key words Deep learning, Machine learning, Generative model, Markov random field, Overlapping 
genes, Multiple sequence alignments, Protein design, Genome compression, Synthetic genomes, 
Synthetic biology 


1. Introduction 


Synthetic biology has led to an explosion in designed genomic parts 
driving the production of novel functions and molecules [1]. This is 
done through the construction of genetic circuits with natural or 
engineered genes controlled by regulatory elements [2]. To make 
the design of engineered genomes easier, most genome design 
approaches seek to refactor genomes to remove genetic overlaps 
and cryptic regulation [3-8]; however, this does not necessarily 
provide evolutionary stability to designs [9]. In fact, engineered 
genes and synthetic architectures often place a deleterious growth 
burden on the expression host [10-12]. Thus, hosts which have 
lost the engineered gene have a growth advantage and over time 
become the dominant population. This phenomenon has led to 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_6, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


95 


96 


Dominic Y. Logel and Paul R. Jaschke 


ways to constrain evolution by tying the expression of the engi- 
neered part to the expression of an essential host component, thus 
linking organism survival to the retention of the engineered com- 
ponent [13, 14]. Design stability is crucial as many future applica- 
tions for synthetic biology technologies are predicated on usage 
outside laboratories, such as engineered nitrogen fixation in cereal 
endophytes [15] and cleaning environmental pollutants [16, 17]. 

A novel way to add genetic stability to engineered genomes is 
called Constraining Adaptive Mutations using Engineered Over- 
lapping Sequences (CAMEOS) which seeks to emulate the 
condensed and overlapped coding sequence architecture found 
primarily in bacteriophage and bacteria [3, 4, 7, 18, 19]. This 
computational approach uses hidden Markov models (HMMs) 
and random Markov fields (MRFs) to determine protein residue 
diversity at a given position, and residue-residue contacts across the 
proteins, to generate overlapped coding sequences containing two 
proteins [20]. 

The foundation for the creation of protein generative models 
(HMMs and MRFs) is a multiple sequence alignment (MSA) 
[21]. For the accurate production of these models, the MSA must 
encompass thousands to tens of thousands of sequences (Fig. 1a). 
There are multiple algorithms available for performing protein 
alignments such as ClustalW [22], FAMSA [23], and MAFFT 
[24] all of which perform differently. 

Following the creation of an MSA of the two proteins to be 
co-encoded, HMMs and MRFs are generated (Fig. 1b). A HMM 
operates by searching for patterns in a sequence space and calculates 
the probability ofa pattern, or state, occurring (e.g., G, C, A, and T 
having a 25% chance) and the transition probability of changing 
states (e.g., 75% change of moving from state 1] to state 2). The 
hidden component is the transitions between states inside an 
observed sequence. The role of the HMM is to represent protein 
sequence conservation across protein family members [25, 26]. A 
MRE is an undirected graphical probability model and represents 
combinations of independent assumptions which more directed 
models, such as Bayesian modeling, cannot accurately depict 
[25, 26]. The role of the MRF is to represent intra-protein 
residue-residue coupling which may be crucial to protein function. 

As the HMM detects conserved direct relationships and the 
MRF detects conserved indirect relationships, these models 
together create a fuller picture crucial of protein sequence and 
structure in targeted proteins [26]. HHMs are used in CAMEOS 
to create co-encoding solutions that are subsequently used as seeds 
in a second step where long-range interactions between protein 
residues are assessed with the MRFs [20]. 


Creating De Novo Overlapped Genes 97 


A. Building Multiple Sequence Alignments 


1. Select target coding sequences 4. Remove outliers from MSA 


Website: EcoCyc Input File(s): proteinA_alignment.msa 
Output File(s): proteinA.fasta and proteinB.fasta proteinB_alignment.msa 
proteins. fasta and cds.fasta Program: OD-seq 
Output File(s): proteinA_trimmed_alignment.msa 


2. Download protein familes from Pfam or InterPro 
proteinB_trimmed_alignment.msa 


Website: InterPro or Pfam 

Steps: Search sequence for protein family and Note: maintain enough sequences for modelling 
download all sequences N/vVL>~200 

Output File(s): proteinA_family.fasta Where: 
proteinB_family.fasta N is number of sequences in MSA 


Lis length of protein of interest 


3. Construct multiple sequence alignment (MSA) 5. Construct MSA (round 2) 


Input File(s): proteinA_family.fasta Input File(s): proteinA_trimmed_alignment.msa 
proteinB_family.fasta proteinB_trimmed_alignment.msa 

Program: MAFFT/FAMSA Program: MAFFT/FAMSA 

Output File(s): proteinA_alignment.msa Output File(s): proteinA.msa 
proteinB_alignment.msa proteinB.msa 


B. Generating protein structure models 


6. Train Hidden Markov Model (HMM) 7. Train Markov Random Field (MRF) 


MFRVELENGHVVTAHISGKM gnchvn 
can ea fs a ae > aa 


A 
a 
s 


19> 
Q 
& 
Input File(s): proteinA.msa Input File(s): proteinA.msa 
proteinB.msa proteinB.msa 
Program: hmmer Program: CCMpred 
Output File(s): proteinA.hmm, .h3f, .h3i, .h3m, and .h3p Output File(s): proteinA.raw 
proteinB.hmm, .h3f, .h3i, .h3m, and .h3p proteinB.raw 


C. Designing synthetic overlapping gene sequences 
8. Convert CCMpred output to Julia compatible file type 


Input File(s): proteinA.raw and proteinB.raw 


Script: convert_ccm_to_jld,jl 
Output File(s): proteinAjld and proteinB,jld 
9. Summarize pseudolikelyhoods and energies 
Input File(s): proteinAjld 
proteinBjld 
Script: energies_and_psls.jl 
Output File(s): psls_proteinA.txt and energy_proteinA.txt 
psls_proteinB.txt and energy_proteinB.txt 
10. Generate gene overlap variants 
Input File(s): runfile.txt proteinAjld proteinA.-hmm 
proteinBjld proteinB.hmm 
Script: main.jl | outparser,jl 
Output File(s): output/ 
summary_BC.csv, all_final_fitness_BC.txt, top_twelve_BC.fa, saved_pop_BC,jld, log_BC.txt, and others 


Fig. 1 CAMEOS workflow. (a) The first step in the CAMEOS workflow is to create and curate MSA for the two 
target proteins. This process requires downloading protein family libraries from Pfam or InterPro, aligning 
these sequences through FAMSA or MAFFT, removing outliers via OD-seq, and repeating the alignment. (b) 
The second step is creating protein structure models (HMM and MRF) through hmmer and CCMpred. (c) The 
final step uses the CAMEOS scripts to generate the synthetic overlapping proteins and the library of the 
overlapping coding sequences 


98 Dominic Y. Logel and Paul R. Jaschke 


2 Materials 
2.1 Hardware 


2.2 Software 


In this chapter, we describe the steps needed to use CAMEOS 
to design de novo overlapped genes. We detail the processes to 
assemble sequences and generate multiple sequence alignments, 
create HMMs and MREFs, run scripts included in the CAMEOS 
directory to preprocess the input data into the correct formats, and, 
finally, run the CAMEOS algorithm itself (Fig. 1c). 


Intel Core 17-4770 3.40GHz with 4 cores and 32 GB RAM. 


1. Ubuntu v20.10: https: //ubuntu.com/download/desktop. 


2. HH-suite (v3.3.0) [27], an open-source package for sensitive 
protein sequence searching based on the pairwise alignment of 
hidden Markov models (HMMs). GitHub: https://github. 
com/soedinglab/hh-suite. 


3. GCC v4.4+, a C compiler written for the GNU operating 
system. Website: https://gcc.gnu.org/. 


4. CMake v2.8+, an open-source cross-platform tool family to 
build, test, and package software. Website: https: //cmake.org/ 


5. CCMpred [28], an open-source package for learning protein 
residue-residue contacts for building Markov random fields 
(MRE). GitHub: https: //github.com/soedinglab/CCMpred. 


6. CAMEOS [20], an open-source package to generate de novo 
overlapped sequences. GitHub: https://github.com/Bio 
secSFA/CAMEOS. 


7. Julia (v1.4.1), a dynamic language for technical computing. 
With packages: BioAlignments, BioSymbols, Logging, Stats- 
Base, JLD, Distributions, ArgParse, and NPZ. 


8. Python (v3.9.5), an open-source cross-platform programming 
language. Website: https: //www.python.org/. 


9. HDF5 (v1.10.6), a data software library and file format to 
manage, process, and store heterologous data. Website: 
https: //www.hdfgroup.org/solutions/hdf5 /. 


10. gzip (v1.10), a data compression program for the GNU operating 
system. Website: https: //www.gnu.org/software/gzip /. 


11. hmmer v3+ [29], an open-source package for searching 
biological sequence databases for homologous sequences. 
GitHub: https: //github.com/EddyRivasLab/hmmer. 


12. zliblg-dev (v1.2.11) and groovy, a compression deflation 
method found in gzip and PKZIP. Website: https: //packages. 
ubuntu.com /bionic/zlib1 g-dev. 


3 Methods 


3.1. Choose Protein 
Sequences to Overlap 


Creating De Novo Overlapped Genes 99 


13. OD-seq, an MSA analysis software package which detects out- 
lier sequences. Download: http://www.bioinf.ucd.ie/down 
load /od-seq.tar.gz. 


14. FASTX-Toolkit (v0.0.14), a collection of command-line stools 
for Short-Reads FASTA/FASTQ files preprocessing. GitHub: 
https: //github.com/agordon/fastx_toolkit. 


15. MAFFT (v7.310), a multiple sequence alignment program for 
Unix-like operating systems. Website: https://mafft.cbre.jp/ 
alignment/software/ 

https: //anaconda.org/bioconda/mafft. 


16. FAMSA (v1.6.2), an algorithm for large-scale multiple 
sequence alignments. Website: https://github.com/refresh- 
bio/FAMSA and https: //anaconda.org/bioconda/famsa. 


Here, we describe the overall workflow to go from a pair of proteins 
we want to overlap to the output of DNA sequences that can be 
synthesized and tested in a wet lab. Complementary information to 
what is presented here can be found in the excellent manual. pdf 
file within the original CAMEOS GitHub repository (https:// 
github.com/wanglabcumc/CAMEOS /tree/master/doc). 
Throughout this “Methods” section, we use code that was down- 
loaded from a fork of the original CAMEOS GitHub repository on 
1 Dec 2021 (https://github.com/BiosecSFA/CAMEOS) that 
improved the original code in several ways. For details, see notes 
(https: //github.com/wanglabcumc/CAMEOS/pull/2). For a 
more comprehensive description of the development and theoreti- 
cal underpinnings of the CAMEOS method, please see the original 
publication by Blazejewski et al. [20]. 


The choice of which two proteins to overlap is nearly limitless, but 
there will be constraints based on sequence similarity and compati- 
bility at the amino acid and DNA (coding sequence) level. The 
CAMEOS method was originally used to generate two sets of E. coli 
sequence pairs (CysJ-InfA and IlvA-CcdB), via >7500 designs. A 
subset of these designs that were experimentally characterized 
showed that protein function and activity were maintained in 
both co-encoded proteins across their designs [20]. Additionally, 
5.8 million theoretical overlaps between 199 essential genes and 
49 non-essential biosynthetic gene sequences were computed. 
These analyses showed that 9% of their computationally analyzed 
subset contained pseudo-likelihood scores exceeding the experi- 
mentally characterized sequence pairs. From this, it was inferred 
that 80% of the biosynthetic genes could be encoded with at least 
one essential gene. In this chapter we will use the zwfA (translation 
initiation factor IF-1) and aroB (3-dehydroquinate synthase) E. colz 


100 Dominic Y. Logel and Paul R. Jaschke 


3.2 Download Target 
Protein and Coding 
Sequences 


coding sequences (see Notes 1 and 2) originally included as exam- 
ples with the CAMEOS code on GitHub. All following examples 
will be just for InfA, but it should be assumed that, where 
appropriate, the same process must also be done for AroB 
sequences. Additionally, because the AroB protein is longer, the 
analyses of this protein will take longer and may require more 
computational resources. 


There are several sources of very large multiple sequence align- 
ments (MSAs) that can be used as a starting point fora CAMEOS 
experiment. We will focus here on Pfam [30] and InterPro [31] 
databases, which are both large collections of protein families cre- 
ated and hosted by EMBL-EBI. 

We will use these databases as sources of large numbers of 
homologous sequences we can use to produce our own high- 
quality MSAs that are then fed into the CAMEOS workflow. 


1. To determine how many protein sequences at minimum we will 
need for our alignments, we can approximate using this for- 
mula: N/sqrt(L) > ~200 where N is number of sequences in 
MSA, sqrt() is the square root, and L is the length of protein of 
interest in amino acids (https://github.com/wanglabcumc/ 
CAMEOS). For InfA with a length of 72 aa, the minimum 
number of sequences in the MSA would need to be N= sqrt 
(72) x 200 > 1697. For AroB, with a length of 362 amino 
acids, the minimum number would be more than double at 
3805 sequences. This number of sequences could be fulfilled 
from either Pfam or InterPro, but we will detail how to down- 
load sequences from InterPro as it provides a higher number. 


2. First navigate to: https://www.ebi.ac.uk/interpro/search/text/. 


3. Keyword search for “IF-1” since this is the protein encoded by 
infA gene (see Note 3). 
4. Potentially more accurate searching can also be done using the 
amino acid sequence of the protein of interest. In this case 
you would use the “Search — By Sequence” menu option 
of InterPro and enter the FASTA sequence of the protein you 
were interested in identifying the protein family of (Fig. 2a) 


5. Click on “ACCESSION” link (IPR004368) for “Translation 
initiation factor IF-1” from InterPro under the “SOURCE 
DATABASE” heading. 


6. Click on Proteins (46 K) header, within this tab. Further 
filtering of the family can be performed to separate “reviewed” 
and “unreviewed” sequences. However, as the 634 reviewed 
proteins falls beneath the >1697 sequences required, we will 
continue with the unfiltered data. 


7. Click on triangle portion of blue Export button on right-hand 
side of page. 


Creating De Novo Overlapped Genes 101 


(A) 
Val xe. 
(B) 
‘ aa 
en a6k — | bes 
° | & 
° ——EE 
| & 
| = 
(C) Download 
Explanation 
ASTA fi tain a Uist of ximately 46k UniProt proteins th the InterPro entry wit IPROOS368. 


We expect this file to contain 46k distinct proteins. If you encounter any problems during the creation of this file, please check the “Code snippet” section of this page 
for to see how to download the data directly onto your computer. 


Fig. 2 InterPro website navigation. (a) Search using the protein sequence of choice in the InterPro search bar 
to identify the protein family of the target protein. (b) After selecting the search results and navigating to the 
protein family, move to the Protein tab on the webpage, find the Export function, and click on the See More 
Download Options when hovering over the FASTA Generate button. (c) On this page, select the chosen data 
outputs and click Download 


8. Hover the cursor over the Generate button beside the FASTA 


entry and you will see a popup window with “See more down- 
load options” (Fig. 2b). Click the button. 


9. On the new page that opens, the “Choose a main data type” 
header should be “Protein.” 


10. Scroll down, and under “Select Output Format” heading, 
change to “FASTA.” 


11. Scroll down to bottom of page and click the Generate button 
(see Notes 4 and 5). 


12. When the data is ready for download, the “Download” button 
will light up (Fig. 2c). Click this button and name the file 
infA.fasta. 


102 Dominic Y. Logel and Paul R. Jaschke 


3.3 Gathering 
Additional Sequences 
with HHblits 


3.4 Gathering 
Additional Sequences 
with PSI-BLAST 


While InterPro and Pfam are excellent resources for downloading 
the sequences of protein families, sometimes more sequences are 
required to train the protein models than are provided by these 
sources. A way of gathering additional sequences is using the 
HHblits tool within HH-suite which iteratively searches sequences 
to detect similarities building high-quality MSAs [27]. There are 
two methods for using HHblits for gathering aligned sequences: 
either using HHblits command hhblits on a CLI or via the 
HHblits webtool. 

The HHblits webtool (https://toolkit.tuebingen.mpg.de/ 
tools/hhblits) will be discussed first as it is the simpler approach, 
although it offers fewer user input options. The tool requires a 
single protein sequence of interest or a MSA as the starting point. 
Additionally, the user is able to specify the following search para- 
meters: (1) the Expect (E) value cutoff for inclusion, the (2) num- 
ber of HHblits search iterations performed, the (3) minimum 
probability in the hitlist, and the (4) maximum number of target 
hits. All modification options are available within a dropdown 
menu, and the default settings are clearly noted. Within a few 
minutes of submission, HHblits will return results listing the 
(1) number of hits, (2) their identity, and (3) the alignment of 
those sequences (Fig. 3a). For the downstream processes, the user 
must navigate to the “Query Template MSA” tab and download 
the full MSA file by clicking the option “Download Full MSA.” The 
file generated from this is in a protein.a3m file format and can be 
easily used with MAFFT without conversion to a protein.msa file 
extension (Fig. 3b). 

The other option to gather more sequences using the HHblits 
algorithm is to download the HH-suite package from GitHub 
directly (see Note 6) and run hhblits from a CLI. The associated 
manual on hhsuite is very detailed and easy to use; however, the 
tool requires a large downloaded sequence database (50+ GB) to 
function. 


A complementary approach to the protein domain-focused data- 
bases InterPro and Pfam, and the search tool HHblits, is the 
algorithm Position-Specific Iterated BLAST, or PSI-BLAST. PSI-- 
BLAST is a publicly available database search tool hosted by NCBI 
which performs an iterative search function against a protein query. 
Detailed instructions for PSI-BLAST are on the NCBI website and 
in this reference [32]. PSI-BLAST has some advantages over the 
previously mentioned tools as it provides increased sequence cover- 
age by trading off poorer identity coverage. This is important as 
building the HMM will require a low number of gaps to generate a 
HMM profile that is usable with CAMEOS. 


Creating De Novo Overlapped Genes 


Input 


aw Output Query Template MSA Query MSA 


Vis Hits Aln Select All Forward Forward Query A3M Color Seqs Wrap Seas 


Number of Hits: 182 


Visualization 


Resubmit Section 


nN 
w 
J 


Uniket 106_11Y¥229 
Uniket 106_UP16014617136 
Uniket 180_AGAGNIE ON: 
Uniket 166_UP 16606 738543 
Uniket 186_ABABNSZS. 
Uniket 1860_UPI88835ACS992 
Uniket 160_UP10618F E6304 
Uniket 166_UP18013/747867D 
Unikef 168_ABRISSSHT2 
Uniket 166_AOA7S7YO81 
Uniket 106_UP 1681969 7F 3C 
UniRet 168_B7UCZ6 
Uniket 1860_UP 16607690169 
UniRef 166_ABAB23INISS 
Uniket 166_UP 16011034247 
Uniket 186_UP 168882266058 
UniRet 106_AGAGS9P IOS 


y. 
oy 
a 


Input Parameter Result Raw Output E-Value Plot Query Template MSA Query MSA 
Select All Forward Selecte Download Reduced A3M Download Full A3M 


Number of Sequences (up to 1000 most diverse sequences): 183 


1. 43122502 


“V-=-QL- 
TAAGITHGMDELYK 


TAAGITHGMHNVQ- 


103 


al | 
a 


Fig. 3 HHblits webserver outputs. (a) Once HHblits has queried the user input sequence, the server generates a 
visualization output aligning returned sequences to the input sequence within the Results Table (b) MSA (in the 
.a3m format) for the sequence search are accessed via the Query Template MSA tab and can be downloaded 


in Reduced or Full formats 


104 Dominic Y. Logel and Paul R. Jaschke 


3.5 Perform Multiple 
Sequence Alignment 
Using MAFFT 


PSI-BLAST is an easy to use tool requiring a single protein 
sequence input. To use the PSI-BLAST function, users navigate to 
the protein BLAST (blastp) on NCBI, enter the input protein in 
the Query Sequence section, and select the PSI-BLAST algorithm 
in the Program Selection section. From the original input, 
PSI-BLAST will generate search results matching the input protein 
limited either by sequence counts (default = 500) or by E value 
cutoff (default is 0.005). These settings can be modified in the 
algorithm parameters on the initial search page to expand or con- 
strict the available results from the first iteration. The initial results 
are used to seed the second iteration which is controlled by select- 
ing the number of sequences to add in the “Run PSI-BLAST 
iteration 2” input. 

Search results can be filtered after by percentage identity, E 
value, query coverage, and threshold cutoffs. Further iterations 
can be performed to expand the sequence counts. All results can 
be downloaded from the browser as either aligned or unaligned 
sequences in FASTA format. 


The original CAMEOS publication used the FAMSA aligner [23] 
in concert with OD-seq [33] to remove outliers >2 standard devia- 
tions away from the sequences in a dataset, followed by manual 
removal of alignment positions when less than 50% of entries were 
aligned amino acids. Alternatively, we present another method 
below using the MAFFT aligner that seems to produce comparable 
results with less manual intervention. 


1. Because the MAFFT aligner is not as efficient as FAMSA with 
large numbers of sequences (>10,000), it may be necessary to 
take a subsample of the files obtained from InterPro. The 
fasta-subsample tool in the MEME Suite [34] is an easy 
way to do this. Because of interference between different tools 
used in this protocol, we used the Conda environment and 
package manager [35] to create a new environment just to 
run the MEME suite. All tools we use in this protocol were 
within the Ubuntu Linux operating system. After starting up 
a CLI: 


(base) $ conda activate meme 


Then from within that Conda environment where MEME 
suite has been installed, you can use the fasta-subsample 
tool. Navigate to the folder containing the infA.fasta file 
then use the CLI: 


(meme) $ fasta-subsample infA.fasta 10000 >infA_sub_10000. 


fasta 


Creating De Novo Overlapped Genes 105 


where 


10000 The number of sequences you wish to subsample 


infA.fasta The FASTA file containing more sequences than you 
require 


>infA_sub_10000. A command to store the results of the subsampling 
fasta into a file called infA_sub_10000. fasta 


You can check that there are actually 10,000 FASTA files within 
this newly created file using the gr ep command and pipe the results 
to the wc command: 


(meme) S$ grep -o ‘>’ infA_sub_10000.fasta | we -l 
10000 


2. Run the MAFFT tool within our (base) environment on the 
newly created subsampled FASTA file of InfA sequences. 


(base) $ time mafft --add infA_sub_10000.fasta --keeplength 


infA.fasta >infA.msa 


We use the time command to show us how long the 
alignment process took after it has completed. The --add 
flag is used to adding unaligned full-length sequence(s) into 
an existing alignment. In this case we are not adding to an 
alignment but using the subsampled InfA sequences in the file 
infA_sub_10000.fasta. This is done so that the alignment 
does not contain too many gaps and is relative to the target 
sequence. The --keep length flag is used to chop off the ends 
of alignments that go over the E. coli InfA target sequence 
(infA.fasta), which simulates the effects of manual pruning 
(see Note 7). 


3. The sequence alignment may contain outlier sequences that 
would reduce the accuracy of the CAMEOS designs. To 
remove outlier sequences from the MSA, we will use OD-seq 
[33] on the alignment (see Note 8). 


$ OD-seq -s 2 -i infA.msa -c infA_trim.msa 


where 


106 Dominic Y. Logel and Paul R. Jaschke 


3.6 Creation of a 
Protein Generative 
Model 


OD- The program that removes outlier sequences. 


-S Flag to specify the number of standard deviations from the mean 
needed to be removed from the alignment (in this case two 
standard deviations are selected). 


-i Flag to specify input file name (infA.msa). 


-C Flag to specify the output file name for sequences with average 
distance of less than two standard deviations to the rest of the 
sequences in the alignment (infA_trim.msa). 


-O (optional) Flag to specify the output file name for sequences with 
average distance of more than two standard deviations to the rest 
of the sequences in the alignment. 


4. The MAFFT alignment is then repeated to align the sequences 
that were not removed by OD-seq. 


5. The FASTA formatted output of MAFFT (and FAMSA) is not 
directly compatible with the CAMEOS scripts so it must be 
converted to a single-line FASTA format. For CAMEOS to 
recognize the MSA files, each sequence, including gap charac- 
ters (—), must occupy only one line (see Note 9). We will use 
the fasta_formatter command of FASTX-Toolkit (see 
Note 10) to do this: 


S$ fasta_formatter -i infA_trim.msa -o infA.msa -w 0 


where 


fasta_formatter Tool used to reformat FASTA sequences. 


-i Flag to specify input file name (infA_trim.msa). 
-O Flag to specify output file name (infA.msa). 
-w Flag to specify the max. sequence line width for output 


FASTA file. The 0 means that sequence lines will not be 
wrapped and all amino acids will appear on the same line. 


Before two protein sequences can be artificially overlapped with the 
CAMEOS algorithm, each protein sequence must be analyzed to 
create both a HMM anda MRF representation. This is done so that 
the CAMEOS algorithm can determine regions of the proteins 
where sequence flexibility and long-range interactions (residue- 
residue contacts) are amenable to coding sequence overlap in dif- 
ferent reading frames. Unless otherwise stated, we assume the 
proteinA.msa and proteinB.msa files (in our case infA.msa 
and aroB.msa) are within the main/ subfolder of the CAMEOS 
script folder and your CLI program’s present working directory 
(pwd) is also main/. 


3.6.1 Training HMM 
Using Hmmer 


Creating De Novo Overlapped Genes 107 
1. Using our generated infA.msa and aroB.msa files, we will 
first generate hidden Markov models (HMMs) of each protein 


using hmmer (http://hmmer.org/) command hmmbuild. 


S hmmbuild infA.hmm infA.msa 


where 
hmmbuild The function of hmmer that builds a HMM 
infA.hmm Output file name 
infA.msa Input file name 


2. Inspect the resulting information generated in the CLI to 
determine if the HMM was generated correctly. The CAMEOS 
script is able to use .hmm files that are incomplete without 
raising an error, but the end results of the process will be 
incorrect. An indication that your HMM files are not correct 
is the final engineered protein sequences generated will include 
gaps and will be shorter than the actual protein sequences put 
into the script in the proteins. fasta file. 

To ensure these errors do not occur, your HMM must be 
of the same length as the input protein. In the hmmbuild 
output, check that “alen” and “mlen” are the same value 
(72 in this case) and that this value is the same as the length 
of the protein in amino acids (72 aa in this case is the full length 
of InfA). 


hmmbuild :: profile HMM construction from multiple sequence 
alignments 

HMMER 3.3 (Nov 2019); http://hmmer.org/ 

Copyright (C) 2019 Howard Hughes Medical Institute. 

Freely distributed under the BSD open source license. 

input alignment file: slyD.msa 

output HMM file: slyD.hmm 


idx name nseq alen mlen eff_nseq re/pos description 


1 infA 1586 72 72 0.52 0.592 


108 Dominic Y. Logel and Paul R. Jaschke 
3. The generated .hmm file is then compressed to create the final . 
hmm/.h3f/.h3i/.h3m/.h3p files CAMEOS requires, using 


the hmmpress command of the hmmer package. 


S$ hmmpress infA.hmm 


where 
hmmpress The function of hmmer that prepares an HMM database 
infA.hmm Input file name 
3.6.2 Training Markov The MRF model for each protein will be trained using CCMpred 
Random Field Using [28] to create residue-residue contact predictions. These models 
CCMpred are for later use in assessing the impact of protein sequence changes 


and their long-range interactions within a protein family. 


1. First we must convert the MSA files to a format that is compat- 
ible with CCMPred (only sequences, no FASTA headers) using 
an inverse-match grep command: 


$ grep -v ">" infA.msa > infA.ccm 


where 
grep A command-line utility for searching lines 
that match a regular expression 
-V Inverse-match flag 
oa The string to match in the input file 
infA.msa Input file 
> Save results of grep to a file 
infA.ccm Output file name 


2. Next we invoke CCMpred to generate a .raw matrix file. 
S$ cempred -t 1 -r infA.raw -n 100 infA.ccm infA.mat 


where 


ccmpred The ccmpred command. 


Se it (optional) Depending on the number of CPUs you have available 
for the computation, you may want to use a value >1 (default) 
here to complete the calculation faster. 


ae Store raw prediction matrix in RAWFILE format flag. 


infA.raw The output raw file. 


(continued) 


3.6.3 Summarizing 
Pseudo-Likelihoods/ 
Energies 


Creating De Novo Overlapped Genes 109 


-n 100 Compute a maximum of NUMITER operations [default: 50]. 


infA. The input file name. 
ccm 

infA. The matrix output file. 
mat 


3. The .raw file is not in the correct format expected by the 
CAMEOS scripts, so it must be converted to an internal 
MRF file format using a Julia [36] language script (con- 
vert_ccm_to_jld.jl). The script generates a Julia Data 
File (.jld) file that is used when the main. j1 CAMEOS Julia 
script is run later: 


§ julia convert_ccm_to_jld.jl infA.raw infA.jld 


where 
julia Runs script with Julia 
convert_ccm_to_jld_jl Name of script to run 
infA.raw Input file name 
infA jld Output file name 


The .jld files must then be transferred to the j31ds/ subfolder 
or the main. 41 script will fail when run later. 


1. Next we must summarize the data from CCMpred into formats 
that work with the CAMEOS scripts. The pseudo-likelihoods 
and energies of the proteins are used in the optimization pro- 
cess, and so these values are calculated using the energie- 
s_and_psls.jl_ script that will output two _ files: 
psls_protein.txt and energy_protein.txt into the 
psls/ and energies/ subfolders, respectively. These folders 
must already exist within the main/ folder, or the main.j1 
script will fail when run. In our example, the files would be 
named psls_infA.txt and energy_infA.txt. 

The script is run: 


$ julia energies_and_psls.jl infA infA.jld infA.msa 


where 


110 Dominic Y. Logel and Paul R. Jaschke 


3.6.4 Setting Up Folder 
Structure 


julia Runs script with Julia. 
energies_and_psls.jl Name of script to run. 

infA The name of the protein. 

infA jld Input Julia Data File name. Note, this was 


generated in the previous step. 


infA.msa Input MSA file name. Note, this was 
generated in a previous step. 


The main. jl script requires all the input files to be in a certain 
folder structure or it will fail. Within the main/ folder, the correct 
subfolder structure is: 


energies/ 
Containing energy_infA.txt and energy_aroB.txt files. 
hmms / 


Containing infA.hmm, infA.hmm.h3f, infA.hmm.h3i, 
infA.hmm.h3m, infA.-hmm.h3p, aroB.hmm, aroB.hmm.h3f, 
aroB.hmm.h3i, aroB.hmm.h3m, and aroB.hmm.h3p files. 


jlds/ 


Containing infA.j1d and aroB.jld. 

NOTE: as of the time of this writing (early 2022), GitHub does 
not host these files correctly, and instead of aroB.j1d being 
~464.9 MB and infA.jld being ~18.3 MB, they instead are 
134 bytes and 133 bytes, respectively. The reduced-size files will 
cause an error if used as is. Two options to get around this limita- 
tion of GitHub are to download the correct files here: https:// 
cloudstor.aarnet.edu.au/plus/s/jpMOfvlyOY2r4Wi. 

Alternatively, if you run the CAMEOS process from the begin- 
ning, as described in this chapter, you will generate your own 
aroB.jldand infA.jl1d files of the correct size. 


msas/ 
Containing infA.msa and aroB.msa. 


output / 


3.6.5 Running CAMEOS 


Creating De Novo Overlapped Genes 111 


Containing nothing. The folder is empty at the start of the 
script. 


psls/ 


Containing psls_infA.txt and psls_aroB.txt. 
Additionally, within the main/ folder, the following files are 
required: 


cds.fasta, proteins.fasta, runfile.txt 


1. The last step before running the main CAMEOS script is to 
modify the file containing the run parameters. Here, we call the 
file runfile.txt, but you can name it whatever makes sense 
to you. The file contains the parameters used during the main. 
j1 script execution and controls aspects of the CAMEOS 
process, such as how many seeds to optimize and how many 
iterations to perform. These parameters must be adjusted care- 
fully because they can have dramatic effects, such as signifi- 
cantly extending runtime. The runfile.txt parameter file 
has the following structure: 


output/ infA aroB jlds/infA.jld jlds/aroB.jld hmms/infA.hmm 
hmms/aroB.hmm 100 pl 250 


The file is tab-delimited (each unit of text is separated by a tab 
character) and stores the following information (see Note 11), 
where: 


output/ Directory where output files are stored. 
infA Gene/Protein A name. 
aroB Gene/Protein B name. 
jlds /infA.jld Path to jld file containing MRF (CCMpred data) for 
Gene A. 
jlds /aroB .jld Path to jld file containing MRF (CCMpred data) for 
Gene B. 
hmms/infA. Path to HMM file (hmmer data) for Gene A. 
hmm 
hmms/aroB. Path to HMM file (hmmer data) for Gene B. 
hmm 
100 Number of seeds to optimize. 
pl Frame. This parameter should not be modified. 
250 Number of iterations of algorithm. 


112 Dominic Y. Logel and Paul R. Jaschke 


3.6.6 Evaluating 
CAMEOS Results 


The time the CAMEOS method takes to complete depends on 


a number of factors such as sequence lengths, seed value, and 
iteration value. 


To run CAMEOS using the main.j1 script, navigate to the 


main/ folder and type: 


S$ julia main.jl runfile.txt 


where 
julia Runs script with Julia 
main.jl The script to run 


runfile. A file containing the parameters used during the CAMEOS run 
txt 


After a successful run, text will be displayed on the CLI, 


similar to: 


Running CAMEOS using parameters specified in runfile.txt 


CAM 


EOS parameters are: 


output/ infA aroB jlds/infA.jld jlds/aroB.jld hmms/infA.hmm 
hmms/aroB.hmm 100 pl 250 


The 


CAM 


random barcode on this run is: G8wCDktg 
EOS tensor built 


Evaluating HMM seeds 


Beginning long-range optimization. 

Step 0 of 250... 

Step 50 of 250... 

Step 100 of 250... 

Step 150 of 250... 

Step 200 of 250... 

1038.885614 seconds (262.57 M allocations: 310.487 GiB, 1.38% 


gc time) 


1. 


In the output/ subfolder, a number of files are created froma 
successful CAMEOS run. The top_twelve_BC.fa (where 
BC is barcode of the run; in our example above, it would be 
top_twelve_G8wCDktg.fa) file contains the best three 
co-encodings of the two genes of interest from the best score 
of protein A (InfA in our example) and protein B (AroB in our 
example). Additionally, the file also contains co-encodings 
(CDS overlaps) with the best overall score. 

Although in most instances the script will just fail if the 
initial files are not of the correct type and location, we have seen 
a few cases where output is generated but is erroneous. For 
example, if the MSA files that are used have more characters in 
them than the sequences in the proteins.fasta file, the 


Creating De Novo Overlapped Genes 113 


output sequences that are generated will have gaps in them. 
Therefore, careful analysis of the output sequences should be 
done before synthesizing the DNA to make the constructs. 

2. An additional script (from: https://github.com/BiosecSFA/ 
CAMEOS) can be used to summarize the information from the 
jld output file into FASTA and comma separated values (CSV) 
files, which are generally easier to look through. 

With the CLI’s present working directory as main/, exe- 
cute the following code: 


§ julia outparser.jl infA aroB BC --fasta 


where 


julia Runs script with Julia. 


outparser. Julia script that parses the CAMEOS output into easier to 
jl analyze files. 


infA The name of the first protein co-encoded. 
aroB The name of the second protein co-encoded. 
BC Barcode from your CAMEOS run (e.g., G8wCDktg). 


--fasta Generates a FASTA file of the results in addition to a CSV. 


--just- (optional): this flag can be used in addition to the --fasta flag to 
fullseq create a FASTA file that only contains the full sequence. 


3. In the example of z#fA encoded within the avoB gene, we can 
see the top scoring hits are located in either the 5’ or 3’ regions 
of aroB (Fig. 4a). Within the 5’ region, three InfA variant 
designs modified the AroB sequence on average 14% to enable 
the co-encoding of InfA into AroB. A similar result was 
observed in the 3’ region of aroB as the five InfA variants 
there modified AroB on average 13%. Despite being in two 
distinct regions, all designs incorporated a new residue at posi- 
tion 30. In most designs, this was lysine; however, in designs 
1 and 20, an arginine and tryptophan were incorporated, 
respectively (Fig. 4b). While no crystal structure is available 
for E. coli AroB, on UniProt an AlphaFold simulation is pub- 
licly available predicting the protein structure [37, 38]. The 
inserted AroB residue is incorporated 5’ adjacent to a proline 
residue which terminates a predicted alpha helix. All three 
modified residues have some favorability to form alpha helices; 
therefore, it is likely that the modification either continues the 
alpha helix one residue or has a secondary structural effect. 
Overall, AroB is predicted to be a highly structured protein; 
therefore, all modified residues will interact with existing sec- 
ondary structures. For example, in the 5’ region containing 


114 Dominic Y. Logel and Paul R. Jaschke 


(A) 


a 3-dehyrodgenase synthase domain a ‘| 


10 20 30 4 50 A 70 80 oo 


Bie td eee a ce gem re 8 eer 


< SS . Loh bs bal fe. hg 
AroB 38 Mi Ee ‘ IRLSSWTS r = EseRA 
AroB720 WE E + RAI LW ria Pal + vuT Me Le SGe 


AroB_89 WE 


AroB 
AroB 1 
AroB_ 13 
AroB_41 
AroB_18 
AroB_ 32 
AroB 38 
AroB_ 20 
AroB_89 


AroB 
AroB 1 
AroB_ 13 
AroB_41 
AroB_18 
AroB_ 32 
AroB 38 
AroB_ 20 
AroB_89 


AroB 
AroB 1 
AroB_ 13 
AroB_41 
AroB_ 18 
AroB_ 32 
AroB 38 
AroB_ 20 
AroB_89 


(C) 


InfA 
InfA 1 
InfA_ 13 
InfA 41 Re 

InfA_ 18 EAIE VIECRRNAY 
InfA 32 i TRAUSGEN 
InfA 38 -| KBSLRDTR 
InfA 20 vs ayt| 
InfA_89 | | VAI AT 


AmmAdImMAMm= 
<4<4<c44K< 


Fig. 4 Aligned CAMEOS AroB and InfA outputs. (a) The eight highest scoring CAMEOS designs incorporated 
InfA within AroB within two regions, at the 5’ and 3’ ends. (b) AroB designs aligned to the wild-type AroB 
sequence showing the locations where residues were modified. (c) InfA designs aligned to the wild-type InfA 
sequence showing the locations where residues were modified 


3.7 Putting It All 
Together 


1 


Creating De Novo Overlapped Genes 115 


InfA designs, the existing structure is a combination of alpha 
helices and beta sheets, while the 3’ region containing InfA 
designs is populated with predominantly alpha helices with a 
single beta sheet. Due to its short size, InfA had more signifi- 
cant modifications to its sequence as on average 45% of the 
amino acid identities were altered (Fig. 4c). Unlike AroB, InfA 
has an experimentally characterized structure which is domi- 
nated by a beta barrel with a single short alpha helix 
[39]. Therefore, due to InfA’s small size and highly structured 
topology, all residue changes would be members of an existing 
beta sheet or alpha helix. 


. Differences in the co-encodings can also be seen when consid- 


ering the predicted translation efficiencies of ivfA from within 
the aroB sequence. Using the RBS Calculator [40] on the top 
five designs, we see a nearly eightfold difference between pre- 
dicted translation efficiency of the worst and best imfA designs. 
Similarly, there is a ninefold difference for aroB (Fig. 5a). The 
correlation between aroB and imfA translation initiation rates 
in this case is due to the N-terminal location of infA 
co-encoding. If imfA is co-encoded more C-terminally, the 
two CDS translation initiation rates (TIRs) are not connected, 
with aroB displaying a strong TIR (5577 AU) and infA dis- 
playing a range of TIRs (1-286 AU), although 20-5577-fold 
lower than aroB (Fig. 5b). For reference, in the z#fA natural 
E. coli genomic context, it has a predicted TIR of 1211 
(AU) which is at least fourfold higher than the best CAMEOS 
encoding. 


. As we have outlined, the successful completion of co-encoding 


two proteins in the same DNA sequence in different reading 
frames is a complex and multistep process. One of the most 
burdensome barriers to entry for molecular biologists is access 
to a computer running Linux and the installation of all the 
tools needed. To ease this process and enable scientists without 
strong computational backgrounds to use the CAMEOS algo- 
rithm, we have created a virtual machine which comes 
pre-loaded with all the tools used in this protocol and can be 
easily run on a computer using Windows or macOS operating 
systems. The virtual machine file is ~20GB large, so ensure you 
have plenty of room on your disk and a fast internet connec- 
tion. The disk size of the virtual machine is 50GB, so as you add 
data to your virtual machine, ensure you have at least 75GB 
free on your computer’s disk. Access the .ova file which con- 
tains the virtual machine (called “overlap”) here: https:// 
cloudstor.aarnet.edu.au/plus/s/800J]7SHTt463KE5. 


The starting password for the Ubuntu operating system on the 


virtual machine is overlap. 


116 Dominic Y. Logel and Paul R. Jaschke 


y = 0.042x- 2.2." 


infA TIR 


0 500 1000 1500 2000 2500 3000 3500 4000 
aroB TIR 


e@ 
0 1000 2000 3000 4000 5000 6000 
aroB TIR 


Fig. 5 Predicted translation efficiencies for CAMEOS designs. (a) Using RBS Calculator on the top five 
N-terminal designs of infA/aroB overlap, we see a strong correlation between aroB translation initiation rate 
(TIR) and the downstream infA TIR. This effect is likely due to the close proximity of the start codons and 
ribosome binding sites of the co-encoded coding sequences. (b) Using RBS Calculator on other aroB/infA 
co-encodings where the start codons and RBSs are spaced further apart shows no correlation between TIRs 
but does show, as before, a wide range of infA TIRs 


Creating De Novo Overlapped Genes 117 


We strongly suggest you change the password once you start 
using the virtual machine. 

To check if the overlap. ova file is downloaded correctly on 
macOS, start up a CLI such as Terminal in the directory you 
downloaded the file to and run: 


S$ md5 overlap.ova 


To check if the overlap. ova file is downloaded correctly on 
Linux, start up a CLI such as Terminal in the directory you down- 
loaded the file to and run: 


S$ md5sum overlap.ova 


To check if the overlap. ova file is downloaded correctly on 
Windows, start up a CLI such as Command Prompt in the direc- 
tory you downloaded the file to and run: 


C:\> certutil -hashfile overlap.ova MD5 
The result of these commands should be: 
b8£894d£3305507bdee6e992ac87d75£ 


If your result does not match, then it is likely that the download 
was interrupted and the overlap.ova file was corrupted. Please 
try to download again. In the future, if the link to this resource 
becomes broken, please check our lab website for details: https: // 
www.jaschke-lab.science/ 

The virtual machine .ova file (over Lap. ova) can be booted up 
using the free Oracle VM VirtualBox software. 


2. We have also created a script in the shell language Bash that can 
be used to accomplish all the previously described steps auto- 
matically, reducing the chances of human error from moving all 
these files around and using tools with certain parameters. This 
Bash script is called run_cameos.sh and is stored within the 
main/ folder of the CAMEOS code on the overlap.ova virtual 
machine. To just download the Bash script, please find it here: 
https://cloudstor.aarnet.edu.au/plus/s/QO 
IKhQAQCaNimdj 


To use the script to perform a run of CAMEOS, you need to 
open the run_cameos.sh script in a text editor and specify the 
protein names you are working on by changing the two variables 
specifying the protein names: 


proteinA=infA 


protein=aroB 


118 Dominic Y. Logel and Paul R. Jaschke 


4 Notes 


Then save and close the run_cameos.sh script file. Next, 
open a CLI in the main/ folder and run the script by: 


S$ bash run_cameos.sh 


As the script is running, it will display updates on which tool is 
being run and its progress in the CLI window. Once the run is 
done, it will display information on where the output files are 
located and how long the script took to run. 


1. More information on these coding sequences can be seen on 
the EcoCyc database [41] here: https://ecocyc.org/gene? 
orgid=ECOLI & id=EG10504 https://biocyc.org/gene? 
orgid=ECOLI & id=EG10074. 


2. Only E. coli sequences have been used with CAMEOS before; 
although in principle nothing is preventing other prokaryote 
coding sequences from being used, any sequence differing 
from the standard codon table or E. colt codon usage would 
need to manually optimize the code. 


3. Text to enter will be supplied with single quotes “text to be 
entered,” and the quotes should not be included unless specif- 
ically stated. 


4. Depending on the number of sequences, this process may take 
more than 1 h to generate the data. 


5. The FASTA sequences could also be downloaded programma- 
tically using the available Application Programming Interface 
(API) using Python, Perl, or JavaScript. InterPro makes this 
process easier by automatically generating the code needed, but 
this method is outside the scope of the current article. 


6. The HH-suite GitHub page has a detailed wiki page with 
examples of how to run their script and is accessible via 
https://github.com/soedinglab/hh-suite/wiki. 

7. https://mafft.cbrc.jp /alignment/server/add.html. 


8. Although available through Bioconductor for the R language, 
we used the CLI version available here from the original publi- 
cation: http://www. bioinf.ucd.ie/download/od-seq.tar.gz. 


9. Some multiple sequence aligners (e.g., MAFFT and FAMSA) 
create output with 50 or 80 characters per line separated by 
newline (\n) characters, which is not suitable for use in the 
CAMEOS script. 


Acknowledgments 


References 


Creating De Novo Overlapped Genes 119 


10. The FASTX-toolkit is available for download through both 
GitHub = (https: //github.com/agordon/fastx_toolkit) and 
through a website (http: //hannonlab.cshl.edu/fastx_toolkit/). 


11. The names used for proteins A and B must be identical to those 
used in the proteins.fasta and cds.fasta files in the 
FASTA header (e.g., >infA). The file names of the files storing 
the energies and pseudo-likelihoods must also include the same 
identifier (e.g., energy_infA.txt and psls_infA.txt). 
Lastly, use a short identifier, such as the three- to four-letter 
gene name, because output files and folders will be named with 
these identifiers. For reference, the CAMEOS publication 
(Figure S2) showed the effects of using different values for 
number of iterations. Most of the tested variants did not dis- 
play dramatically improved summed pseudo-likelihood scores 
after ~300 iterations. The Frame value p1] should not be altered 
as it is the only option currently supported. 


We recognize that the intellectual and physical labor of this research 
was conducted on the traditional lands of the Wallumattagal clan of 
the Dharug nation and the Gadigal, Wangal, and Cammeraygal 
peoples of the Eora nation. We thank G Sullivan for assistance 
with FASTX-toolkit; T Blazejewski for explaining in detail to us 
how CAMEOS works; and JM Marti, J Allen, Y Jiao, D Park, and D 
Ricci for helpful discussions and their improvements to the original 
CAMEOS code. This research was conducted by the Australian 
Research Council Centre of Excellence in Synthetic Biology (proj- 
ect number CE200100029) and funded by the Australian Govern- 
ment. DYL was supported by the Macquarie University COVID 
Recovery Postdoctoral Fellowship. PRJ was supported by 
NHMRC Ideas Grant APP1185399. 


1. Tang T-C et al (2021) Materials design by 


synthetic biology. Nat Rev Mater 6(4): 
332-350 


. Brophy JAN, Voigt CA (2014) Principles of 


genetic circuit design. Nat Methods 11(5): 
508-520 


. Chan LY, Kosuri S$, Endy D (2005) Refactor- 


ing bacteriophage T7. Mol Syst Biol 1(1): 
2005.0018 


. Jaschke PR et al (2012) A fully decompressed 


synthetic bacteriophage 9X174 genome assem- 
bled and archived in yeast. Virology 434(2): 
278-284 


. Logel DY, Trofimova E, Jaschke PR (2022) 


Codon-restrained method for both eliminating 


and creating intragenic bacterial promoters. 
ACS Synth Biol 11(2):689-699 


. Temme K, Zhao D, Voigt CA (2012) Refactor- 


ing the nitrogen fixation gene cluster from 
Klebsiella oxytoca. Proc Natl Acad Sci 
109(18):7085-7090 


. Wright BW, Molloy MP, Jaschke PR (2022) 


Overlapping genes in natural and engineered 
genomes. Nat Rev Genet 23(3):154-168 


. Song M et al (2017) Control of type III protein 


secretion using a minimal genetic system. Nat 
Commun 8:14737 


. Springman R et al (2012) Evolutionary stabil- 


ity of a refactored phage genome. ACS Synth 
Biol 1(9):425-430 


120 


10. 


ll. 


12. 


13. 


14. 


15. 


16. 


TZ. 


18. 


19. 


20. 


21. 


22: 


23 


24. 


Dominic Y. Logel and Paul R. Jaschke 


Borkowski O et al (2016) Overloaded and 
stressed: whole-cell considerations for bacterial 
synthetic biology. Curr Opin Microbiol 33: 
123-130 

Ellis T (2019) Predicting how evolution will 
beat us. Microb Biotechnol 12(1):41 

Gallup O, Ming H, Ellis T (2021) Ten future 
challenges for synthetic biology. Eng Biol 5(3): 
51-59 

Clark M, Maselko M (2020) Transgene bio- 
containment strategies for molecular farming. 
Front Plant Sci 11:1-11 

Maselko M et al (2017) Engineering species- 
like barriers to sexual reproduction. Nat Com- 
mun 8(1):883 

Ryu M-H et al (2020) Control of nitrogen 
fixation in bacteria that associate with cereals. 
Nat Microbiol 5(2):314-330 

Rylott EL, Bruce NC (2020) How synthetic 
biology can help bioremediation. Curr Opin 
Chem Biol 58:86-95 

Voigt CA (2020) Synthetic biology 
2020-2030: six commercially-available pro- 
ducts that are changing our world. Nat Com- 
mun 11(1):6379 

Jaschke PR et al (2019) Definitive demonstra- 
tion by synthesis of genome annotation com- 
pleteness. Proc Natl Acad Sci 116(48): 
24206-24213 

Wright BW et al (2020) Genome modulariza- 
tion reveals overlapped gene topology is neces- 
sary for efficient viral reproduction. ACS Synth 
Biol 9(11):3079-3090 

Blazejewski T, Ho H-I, Wang HH (2019) Syn- 
thetic sequence entanglement augments stabil- 
ity and containment of genetic information in 
cells. Science 365(6453):595-598 
Chowdhury B, Garai G (2017) A review on 
multiple sequence alignment from the perspec- 
tive of genetic algorithm. Genomics 109(5): 
419-431 

Thompson JD, Higgins DG, Gibson TJ 
(1994) CLUSTAL W: improving the sensitivity 
of progressive multiple sequence alignment 
through sequence weighting, position-specific 
gap penalties and weight matrix choice. 
Nucleic Acids Res 22(22):4673-4680 


. Deorowicz S$, Debudaj-Grabysz A, Gudys A 


(2016) FAMSA: fast and accurate multiple 
sequence alignment of huge protein families. 
Sci Rep 6(1):33964 

Katoh K et al (2002) MAFFT: a novel method 
for rapid multiple sequence alignment based on 


fast Fourier transform. Nucleic Acids Res 
30(14):3059-3066 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


Eddy SR (2004) What is a hidden Markov 
model? Nat Biotechnol 22(10):1315-1316 
Thomas J, Ramakrishnan N, Bailey-Kellogg C 
(2008) Graphical models of residue coupling in 
protein families. IEEE/ACM Trans Comput 
Biol Bioinform 5(2):183-197 

Steinegger M et al (2019) HH-suite3 for fast 
remote homology detection and deep protein 
annotation. BMC Bioinformatics 20(1):473 
Seemayer S, Gruber M, Séding J (2014) 
CCMpred--fast and precise prediction of pro- 
tein residue-residue contacts from correlated 
mutations. Bioinformatics 30(21):3128-3130 
Finn RD, Clements J, Eddy SR (2011) 
HMMER web server: interactive sequence sim- 
ilarity searching. Nucleic Acids Res 39(Web 
Server issue): W29-W37 

Finn RD et al (2014) Pfam: the protein families 
database. Nucleic Acids Res 42(D1): 
D222-D230 

Blum M et al (2021) The InterPro protein 
families and domains database: 20 years 
on. Nucleic Acids Res 49(D1):D344—D354 
Bhagwat M, Aravind L (2007) PSI-BLAST 
tutorial. In: Bergman NH (ed) Comparative 
genomics. Humana Press, Totowa 

Jehl P, Sievers F, Higgins DG (2015) OD-seq: 
outlier detection in multiple sequence align- 
ments. BMC Bioinformatics 16(1):1-11 
Bailey TL et al (2015) The MEME suite. 
Nucleic Acids Res 43(W1):W39-W49 
Continuum A (2015) Anaconda software dis- 
tribution computer software vers 2-2.4. 0. 
Bezanson J et al (2012) Julia: A fast dynamic 
language for technical computing. arXiv pre- 
print arXiv:1209.5145 

Jumper J et al (2021) Highly accurate protein 
structure prediction with AlphaFold. Nature 
596(7873):583-589 

Varadi M et al (2022) AlphaFold Protein Struc- 
ture Database: massively expanding the struc- 
tural coverage of protein-sequence space with 
high-accuracy models. Nucleic Acids Res 50 
(D1):D439-D444 

Sette M et al (1997) The structure of the trans- 
lational initiation factor IF] from E. coli con- 
tains an oligomer-binding motif. EMBO J 
16(6):1436-1443 

Reis AC, Salis HM (2020) An automated 
model test system for systematic development 
and improvement of gene expression models. 
ACS Synth Biol 9(11):3145-3156 

Keseler IM et al (2017) The EcoCyc database: 
reflecting new knowledge about Escherichia 
coli K-12. Nucleic Acids Res 45(D1): 
D543-D550 


Check for 
updates 


Design of Gene Boolean Gates and Circuits with Convergent 
Promoters 


Biruck Woldai Abraha and Mario Andrea Marchisio 


Abstract 


Gene digital circuits are the subject of many research works due to their various potential applications, from 
hazard detection to medical diagnostic. Moreover, a remarkable number of techniques, developed in 
electronics, can be used for the construction of biological digital systems. In our previous works, we 
showed how to automatize the design and modeling of gene digital circuits whose gates were based on 
transcription and translation regulation. In this chapter, we illustrate how Boolean gates could be imple- 
mented by following a particular architecture, the convergent promoter one, rather diffuse in nature but 
seldom adopted in Synthetic Biology. Beside gate design, we also explain how to extend our previous 
modeling approach, based on composable parts and pools of molecules, to quantitatively describe and 
simulate this particular kind of digital biological devices. 


Key words Boolean gates, Digital circuits, Convergent promoters, RNA polymerase II collision 


1. ‘Introduction 


Boolean gates are commonly used in electronics where they 
represent the basic components of digital circuits that are also at 
the basis of how computers work. Logic formulae are made of 
variables, referred to as Jiterals, organized into clauses. A logic 
formula corresponds to a digital circuit where the literals are the 
circuit inputs and the clauses are Boolean gates. Literals are binary 
variables, i.e., they can assume only two values: 0 (FALSE) and 
1 (TRUE). The NOT (~) operator permits to switch the value of a 
literal from 0 to 1 or vice versa. Clauses are connected via other 
logic operators: AND (A) and OR (V). The output of a logic 
formula/circuit is a binary variable as well. 

In logic terms, a truth table is an object that explains, in a 
complete and concise way, the relationship between the inputs 
and the output of a circuit. Any truth table can be converted into 
two equivalent logic formulae that, however, lead to slightly 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_7, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


121 


122 Biruck Woldai Abraha and Mario Andrea Marchisio 


Sense 


PrP 
D 
Q 


& 
T 
Tn 


USN 


different circuit schemes. One formula is written in the disjunctive 
normal form (DNF) or sum of products (SOP). Here, every clause 
contains logic multiplication among literals, and the output of the 
formula corresponds to the sum of the outputs of all the clauses. 
The other way of expressing a formula is the in the conjunctive 
normal form (CNF), where clauses contain sum of literals and are, 
in contrast, multiplied to each other (product of sums, POS). In 
principle, one initially defines, through the truth table, the function 
of a digital circuit. The truth table is then converted into both SOP 
and POS formulae via, for instance, the Karnaugh map method [1], 
and then, the formula that minimizes the number of gates is imple- 
mented in the lab. 

In previous works from our lab, we have shown that as long as 
the number of input signals is lower than or equal to 4, the Kar- 
naugh map method can be exploited for the automatic design of 
digital synthetic gene networks [2, 3]. With respect to the circuit 
schemes derived from POS formulae, those following SOP formu- 
lae reduce the number of gates in the circuit by one unit by 
requiring that each clause produces the circuit output, for instance, 
a fluorescence protein. This is the so-called distributed output 
architecture [4]. Although important, in order to select a scheme 
for a digital circuit, the number of transcription units is not the only 
parameter to take into account. Boolean gates are transcriptional 
units where the logic behavior depends on control of transcription 
and/or translation (see Fig. 1 for the symbols used throughout this 
chapter). In our previous works, this control could be achieved 


Symbols 
Antisense 
Promoter ad 
Coding Region (CDS) ci 


Reporter protein gene 


Incomplete reporter protein gene 


Terminator L 
siRNA 


dsRNA 


mRNA 


Fig. 1 Symbols used in this chapter for the design of genetic Boolean gates and circuits 


2 Methods 


2.1 Convergent 
Promoters 


Boolean Gates with Convergent Promoters 123 


mainly in three ways: at the promoter level via transcription factor 
proteins (TFs) and at the mRNA level via either small RNAs or 
riboswitches. Since the engineering of small RNAs is much easier 
than that of new proteins and riboswitches, moreover, interact 
directly with chemicals, we proposed a complexity score to evaluate 
the actual difficulty of implementing a circuit in the lab based on the 
number of TFs and small RNAs rather than genes in the circuit 
[2]. Slightly different was the criterion described in the later work 
by Gander et al. [5], where, moreover, CRISPR-dSpCas9 [6] had 
been employed and shown to be a powerful instrument to simplify 
the structure of logic networks. 

In Synthetic Biology, digital circuits are widely studied and have 
been engineered in different organisms because of their vast num- 
ber of applications, from biocomputing [7, 8] to biosensors 
[9, 10]. They can be used for medical diagnostic [11] or even as 
therapeutic devices [12]. 

In this chapter, we want to describe how to use convergent 
promoters in order to build synthetic gene Boolean gates and basic 
digital circuits. First, we are going to illustrate the molecular biol- 
ogy of this particular promoter configuration and how the RNA 
polymerase II collision, induced by this promoter architecture, can 
be exploited to mimic logic function. Then, we will describe how to 
design Boolean gates taking up to three inputs. Finally, we will 
show how molecular phenomena due to convergent promoters 
have been modeled in the past and propose our mathematical 
description of RNA polymerase IT collision within the framework 
of composable parts (see Note 1) that we developed for the modular 
design of synthetic gene circuits [13]. 


Convergent promoters represent a way transcriptional interference 
takes place in the cells [14]. As the name says, in this configuration 
two promoters face each other on the DNA and share part of their 
transcripts (see Fig. 2a). The different strength of the promoters 
determines which gene is transcribed in higher quantity. Beside 
convergent promoters, other promoter arrangements can lead to 
transcriptional interference such as tandem and overlapping pro- 
moters (Fig. 2b). 

Different mechanisms can determine transcriptional interfer- 
ence. Promoter competition occurs when RNA polymerase, by bind- 
ing a promoter, hinders the binding of other RNA polymerase 
molecules to a second promoter nearby (this can be the case of a 
tandem or an overlapping promoter, as in Fig. 2b). Occlusion is due 
to the transient, though frequent, occupancy of a promoter due to 
RNA polymerases coming from a close, strong promoter. Sitting 
duck refers to the situation in which RNA polymerase molecules 


124 Biruck Woldai Abraha and Mario Andrea Marchisio 


A) Convergent promoters 


Lape Maines 


common transcript 


B) Other forms of transcriptional interference 


| | ; [> tandem promoters 
tL <] i’ DT) overlapping promoters 


Fig. 2 Transcriptional interference. Among the main mechanisms that cause transcriptional interference, there 
are (a) convergent promoters and (b) tandem or overlapping promoters. The net result of transcriptional 
interference is a reduction in gene expression. In the figure, some parts are enclosed into blue frames to 
distinguish them from the other parts that are in the opposite orientation 


take too long to initiate transcription such that they are dislodged 
from the DNA from other RNA polymerases coming, once again, 
from a near “aggressive” promoter. Roadblock can be seen, some- 
how, as an opposite phenomenon to sitting duck. In this case, RNA 
polymerases are too tightly bound to the open complex formed 
during initiation such that RNA polymerases coming from an 
opposite promoter are push off the DNA. It should be noted, 
though, that roadblock can take place also far from a promoter if 
an RNA polymerase molecule is stalling on the DNA. Finally, 
collision is literally a clash between RNA polymerases elongating 
in opposite directions along the DNA. As a result, either just one or 
both RNA polymerases fall off the DNA and terminate transcrip- 
tion (see Fig. 3). Theoretical studies suggest that the rate of collision 
increases with the distance between and the activity of the conver- 
gent promoters [15]. In this chapter, we will consider only RNA 
polymerase collision as a phenomenon through which convergent 
promoters permit to mimic logic formulae. 

Convergent promoters have been already used in Synthetic 
Biology in a different context, namely, to reconstruct RNA inter- 
ference (RNAi) in S. cerevisiae. They appeared to be an efficient way 
to produce the siRNA precursor that is later processed by the Dicer 


Boolean Gates with Convergent Promoters 125 


PI cel Occlusion 


i > el Sitting duck 


Symbols 


(2 => | —— General DNA sequence 


Roadblocks RNAp from the sense 
promoter 


ED RNAp from the 
antisense promoter 


PI _ @ @_ < ( falling off the DNA 


Collision 


Fig. 3 Possible mechanisms that trigger transcriptional interference 


2.2 Boolean Gates 
Based on Convergent 
Promoters 


(first) and the Argonaute (later) into siRNAs (small interfering 
RNAs). Upon binding the Argonaute, siRNAs give rise to the 
RISC (RNA-induced silencing complex). siRNAs are designed to 
bind, by base complementarity, the mRNA of a target gene that is 
then cut by the Argonaute and finally degraded by the cell machin- 
ery. It should be noted that both the Dicer and the Argonaute 
genes are missing from the S. cerevisiae genome and have to be 
reinserted into its chromosomes or expressed upon transformation 
with either centromeric or episomal plasmids [16-18] (see Fig. 4). 


In order to describe the design of synthetic biological Boolean 
circuits, we are going to use the same formalism as in our previous 
works on this topic [2, 3]. Circuit inputs are chemicals that can be 
divided into two categories: inducers (7) and corepressors (c) 
[19]. The former activate transcription, the latter inhibit it. The 
circuits we depict in this chapter are transcriptional networks. 
Hence, input chemicals act on transcription factor proteins. They 
are either repressors (R) or activators (A) that can lie into two 
different states: active (R’, A”), i.e., able to bind the DNA, and 
inactive (R’, A’), i.e., uncapable of adhering to the double strand. 
Thus, inducers interact with R” (turning them into inactive ones, 


126 Biruck Woldai Abraha and Mario Andrea Marchisio 


RNAi based on convergent promoters 


convergent promoters Dicer-expressing TU 
an translation 
transcription 


siRNA precursor 


€ Dicer 
Argonaute-expressing TU 
processing [= [> | 


TE EL EE E smaitdsrnas 
Fluorescence-expressing TU 


[? @ binding and strand removal 


RISC 


Argonaute 


mRNA cleavage and degradation 


Fig. 4 Convergent promoters and RNA interference reconstruction in S. cerevisiae. The RNAi circuit in figure is 
organized into four transcription units (TUs), one consisting of convergent promoters flanking a fragment of a 
reporter (fluorescent) protein. Convergent promoters lead to the synthesis of a long siRNA precursor that is 
processed, by the Dicer, into small double-stranded RNAs. They are loaded into the Argonaute where one 
strand is removed and the other permits the formation of the RISC: RNA-induced silencing complex. The siRNA 
binds, by base pair complementarity, the mRNA of the fluorescence protein and puts the Argonaute in the 
condition to cut the mRNA. Upon cleavage, the mRNA degradation pathway is activated. If the convergent 
promoters can be induced via a chemical, cell fluorescence could be regulated through the reconstructed RNAi 
pathways. Notice that, even though we are considering S. cerevisiae cells, compartments have been omitted 
from the figure 


R’) and A’ (changing their configuration into the active one, A”) 
since transcription can take place when repressors are detached 
from the DNA or activators are anchored to it. Following the 
same logic, corepressors dock either to R’ (making them active, 
R") or A’ (turning them into their inactive form, A’). Therefore, 
corepressors favor the binding of repressors and hinder the arrival 
of activators at the target promoters. Overall, the admitted reac- 
tions among the circuit inputs and transcription factors are 


RSS 


Boolean Gates with Convergent Promoters 127 


i+A\ \a ™ 


ZPDT IPod 


Fig. 5 NOT gates. Inducers permit to realize the NOT logic operation when acting on an antisense promoter. 
Like in every other figure throughout the chapter, green arrows represent transcription activation, whereas 
hammer-like red arrows stand for repression of transcription 


i+A’s A® 


ee ge whe 
c+ A= A’ (1) 


Basic logic operations, i.e., YES (buffer gate in electronics) and 
NOT, are realized by joining two transcription units, one of which 
can be based on convergent promoters. They make use of the 
reactions in Eq. 1 and are illustrated in Figs. 5 and 6. 

An inducer, as an input, demands convergent promoters to 
realize NOT gates. As shown in Fig. 5, we suppose that the two 
promoters share the same CDS. The “sense” promoter drives the 
correct mRNA transcription and leads to the synthesis of a func- 
tional protein. The “antisense” promoter, in contrast, would not 
produce anything useful to the cell. Here, the antisense promoter is 
inducible. Hence, in the absence of the inducer, it is either switched 
off by the active repressor R® (upper right panel) or simply incapa- 
ble of synthesizing mRNA because it is not bound by the 
corresponding activator, whose wild-type state is inactive (A‘, 
lower right panel). As soon as the antisense promoter gets activated, 
RNA polymerase II binds to it and starts elongating along the 


128 


Biruck Woldai Abraha and Mario Andrea Marchisio 


mez 
Aa \A 


Ipouo a 


oe ° 


PRT Per 


Ri Ri 


[?7DT 
— - 


Fig. 6 YES gates. Corepressor molecules realize “buffer gate” when directed to an antisense promoter 


coding region. Hence, a collision with RNA polymerase II mole- 
cules coming from the sense promoter (constitutively active) takes 
place and decreases protein synthesis. A NOT logic behavior can be 
mimicked properly by tuning the strength of the antisense pro- 
moter, i.c., by maximizing the collision rate between RNA 
polymerase II. 

By using a corepressor molecule, in contrast, a convergent 
promoter is required to reproduce the YES behavior (Fig. 6). The 
antisense promoter is either activated by A” (upper left panel) or 
repressed by R‘ (a repressor that is inactive in its ground state, lower 
left panel). Hence, in the absence of c, transcription takes place 
from the antisense promoter, and the output (a protein) is not 
produced due to RNA polymerase collision, provided that the 
antisense promoter is strong enough. If the corepressor c is added 
to the system, either the activator becomes inactive or the repressor 
gets active. In both cases, the antisense promoter transcription rate 
should be reduced to the leakage level with a consequent increase in 
protein production. 

These are the ways convergent promoters can be employed to 
represent a literal and its negation. Figures 5 and 6 show also the 
alternative design based on “standard” transcription units contain- 
ing a single sense promoter. By combining these functionalities 


2.2.1 Two-Input 
Boolean Gates 


Table 1 


Boolean Gates with Convergent Promoters 129 


properly, we will illustrate how to construct, with convergent pro- 
moters, basic Boolean gates accepting more than a single input. 


As we have seen in the previous section, the architecture of a YES or 
a NOT gate depends only on the kind of the input chemical and the 
promoter (sense or antisense) on which the chemical acts on. The 
nature of the transcription factor associated with the chemical does 
not play any particular role. Moreover, a convergent promoter is a 
multiplicative architecture that produces an output equal to 1 only 
when the sense promoter is activated, and the antisense promoter is 
inhibited. In all the other possible cases, the result is equal to zero. 
In other words, convergent promoters permit the design of multi- 
plicative gates. 

If we want to realize the AND gate, a A J, we necessitate a 
positive literal on both promoters. Hence, according to Table 1, we 
need an inducer targeting the sense promoter and a corepressor the 
antisense one. Each chemical can be associated with two different 
transcription factors. Thus, we have overall four possible schemes 
for an AND gate (see Fig. 7). 

Similarly, we can design four circuits representing different 
configurations of the N-IMPLY gates (both # A b and aA b) and 
the NOR gate (@\ b=aV 0d, the latter equation is derived by 
applying one of the De Morgan’s laws). In Fig. 8, an implementa- 
tion of these two kinds of gates is sketched. Interestingly, an 
N-IMPLY gate demands to use two molecules of the same type, 
i.e., either two inducers or two corepressors. It should also be 
noted that the NOR gate (together with the NAND one) is of 
particular importance since it is a universal gate, i.e., it allows the 
construction of every possible digital circuit, no matter its 
complexity. 

Other two-input gates correspond to more complex formulae. 
In this case, we can still make use of the convergent promoters by 
using the so-called distributed output architecture that, as men- 
tioned above, is a realization of the DNF (disjunctive normal form) 
or SOP (sum of product) circuit representation. An XOR gate, for 
instance, returns 0 when both inputs are identical and 1 when they 
are different. It can be expressed as: (aA b) V (aA b). The OR 


Assigning input chemicals—and corresponding transcription factors—to the sense and antisense 
promoter in order to have positive (YES) or negative (NOT) literals 


System YES NOT 

i+ R° Sense Antisense 
i+ A’ Sense Antisense 
c+R Antisense Sense 

c+ A” Antisense Sense 


130 Biruck Woldai Abraha and Mario Andrea Marchisio 


rel (fer, el QT 


Ra 
Re | FQ) IT) ota we | 9) ITM or 
wee Pe pe Thi 


c c 
AND gate 
i Cc pol> | <pol| out | 
0 0 
0 1 
1 0 
1 1 


[* DT. POT 2 [? DT. ki DT, 
ra 1 79) IT or HA | 7 Q| ITT) ott 
— Thies, c j—" aoe c 


Fig. 7 Possible four different implementations of an AND gate based on convergent promoters. The columns 
named “pol >” and “< pol” refer to RNA polymerase II elongating from the sense and the antisense promoter, 
respectively. The symbols “>” and “<” indicate the direction in which RNA polymerase II flows. “X,” in 
contrast, means that RNA polymerase II cannot start elongation from the corresponding promoter 


operation between the two N-IMPLY clauses is obtained by requir- 
ing that each multiplicative gate produces the same molecule, i.e., a 
fluorescence protein, the circuit output (see Fig. 9). 

ANAND gate can also be designed by means of the distributed 
output architecture since the formula (a A b) becomes, via the De 
Morgan’s law, (@ V b). However, since we have two negated literals, 
if we want to make use of convergent promoters only, both a and 
b must be inducers, as shown in Fig. 10. 

Potentially, by assigning a single literal (input) to each pro- 
moter and making use, when necessary, of the distributed output 
architecture, every two-input Boolean gate can be designed via 
convergent promoters. However, not every configuration is possi- 
ble. As we have seen, N-IMPLY gates demands inputs of the same 
kinds, whereas AND gates can be constructed only when one input 
is an inducer and the other a corepressor. Obviously, more solutions 
can be achieved by mixing gates based on convergent promoters 
and traditional transcription units (promoter-CDS-terminator). In 
particular, complex logic formulae would arise by combining NOR 
gates as in Fig. 8 with classical inverters (NOT gates). 


Boolean Gates with Convergent Promoters 131 


eRe | PQ@| <I #re 
we 


i 


a tAi=cvi 


Fig. 8 Possible schemes of an N-IMPLY and a NOR gate 


2.2.2 Three-Input 
Boolean Gates 


Two-input gates based on convergent promoters are designed by 
taking into account only the kind of signals acting on the two 
promoters. In contrast, “traditional” two-input Boolean gates, 
containing the sense promoter only, are built by placing two opera- 
tors, i.e., the sequences where transcription factors bind, along the 
(sense) promoter. Hence, the logic behavior strongly depends on 
the interaction among the DNA and the transcription factors 
[20]. For instance, two inducers 7; and 7 acting on two active 
repressors, R{ and R5, give rise to an AND gate. In contrast, if 
they act on two inactive activators, Aj and A}, they produce an OR 
gate unless a strong cooperativity is required for the simultaneous 
binding of the two proteins to the DNA: in this improbable case, 
the logic function would be an AND gate [21]. 

If we want to use convergent promoters to build more complex 
logic gates, accepting, for instance, three inputs, then the interac- 
tions among the transcription factors binding the same promoter 
cannot be neglected. As we have mentioned in the Introduction, 
there are many different, possible, ways to realize Boolean gates. 
Here, we consider only pure transcriptional gates, where a logic 


132 Biruck Woldai Abraha and Mario Andrea Marchisio 


clause A clause B 
(i, Ab) VG, Ab) 


PDT PDT 


iy+Rat Rat Ra jgtRa2 


LA(elm LP) dm 


Fig. 9 An XOR gate based on the distributed output architecture between two transcription units containing 
convergent promoters. The two clauses, and the relative convergent promoters, have been named A and B to 
fully illustrate, in the truth table, the working of this rather complex design 


function arises from the action of transcription factors on promo- 
ters. Since each input chemical is associated with a different protein, 
we can neglect any kind of hetero-cooperativity (which are quite 
rare). Hence, two repressors switch off transcription independently 
on each other. Similarly, two activators turn on protein synthesis 
separately. Finally, if an activator and a repressor bind the same 
promoter simultaneously, the repressor wins. 

Before starting the description of three-input Boolean gates, it 
is worth noting that, inside a complex logic network, the inputs ofa 
gate are transcription factors that do not interact with any chemi- 
cals. Previously, we have pointed out that a promoter regulated by 
two active repressors, Rf and R5, behaves as an AND gate if the 
two repressors are inactivated by different chemicals, 2; and 7. In 
contrast, if no input signal is able to inhibit Rf and R45, then a 
transcription unit with this configuration would become a NOR 
gate, able to drive protein synthesis only in the absence of the two 
inputs, the repressors in this case. 


Boolean Gates with Convergent Promoters 133 


yAh=l4VbL 


Fig. 10 A NAND gate made of two separate convergent-promoter devices. This design is possible only if the 
inputs are both inducers 


The key point in building three-input gates based on conver- 
gent promoters is the multiplicative logic interaction (AND) 
between the sense and antisense promoter. Let us suppose to send 
two inputs to the sense promoter and one to the antisense pro- 
moter. If we want to implement a three-input AND gate, then we 
have to combine a two-input AND gate on the sense promoter with 
a positive literal (YES gate) on the antisense one. The AND gate on 
the sense promoter can be realized in the way we just mentioned 
above: two inducers inhibiting their corresponding repressors. As 
for the YES gate on the antisense promoter, we cannot obtain it 
with a third inducer molecule, but we need a corepressor acting, for 
instance, on an inactive repressor protein. The overall scheme is 
shown in Fig. 11. 

An alternative design would require an inducer acting on the 
sense promoter and two corepressors binding active activators 
controlling the antisense promoter (see Fig. 12). It should be 
noted that, by substituting an active activator with an inactive 
repressor, the overall three input device would no longer behave 


134 Biruck Woldai Abraha and Mario Andrea Marchisio 


(i, A b A C3) 


Fig. 11 A three-input AND gate. The AND behavior between i, and i is realized easily on the sense promoter 
and then coupled with a corepressor, C3, that carries out a YES function on the antisense promoter. Essential 
for the behavior of the whole gate is the multiplicative relation between the two promoters 


as an AND gate as shown in Fig. 13. Other gates with different kind 
of complexity can be designed, potentially, in the same way fol- 
lowed so far. Moreover, considering the synthetic gene digital 
circuits in the literature, it is not recommendable to regulate pro- 
moters with more than two inputs. 

In general, the convergent promoter architecture offers inter- 
esting novel solutions for Boolean gate implementation. However, 
as mentioned above, it might be too complex to build an entire 
digital circuit with gates containing only convergent promoters. 
Hybrid gates, mixing transcription and translation regulation, 


Boolean Gates with Convergent Promoters 135 


(i, A 3) A C3) 


Fig. 12 Alternative design for a three-input AND gate. Differently from Fig. 11, the sense promoter hosts a YES 
gate combined with a two-input AND gate on the antisense promoter. Also in this case, the three input signals 
are not all of the same type. Moreover, it is essential that the corepressors act on active activators 


might give rise to new interesting and useful architectures. Further- 
more, different gate configurations could be combined within cell 
consortia [22] to make more feasible the implementation of intri- 
cated genetic networks. In general, many schemes can be designed 
to represent the same digital functions. However, methods for 
accurate performance prediction and evaluation of construction 
complexity are not fully established yet. 


136 Biruck Woldai Abraha and Mario Andrea Marchisio 


Fig. 13 Role of the transcription factors in determining a logic behavior. With respect to the AND gate in 
Fig. 12, we exchange AS with A: This modification was enough to complete spoil the AND gate by giving rise 
to a different logic formula 


2.3 Modeling In the previous section, we have described possible rules for 
Transcription designing gene Boolean gates based on convergent promoters. 
Interference Their working depends on a particular kind of transcriptional inter- 


ference, 1.e., RNA polymerase (II) collision. Clearly, since we cared 
about architectural aspects only, we assumed that RNA polymerase 
II collision was always highly effective such that RNA polymerase II 
molecules “fired” by the antisense promoter were able to stop 
transcription completely after clashing with those coming from 
the sense promoter. In reality, it is not clear how to predict the 
result of such a collision. Indeed, not necessarily both RNA poly- 
merase II molecules fall off the DNA, and, depending on several 
factors such as the relative strength of the two promoters and the 
distance between them, the probability that one of the two RNA 
polymerases results unaffected by the collision and goes on synthe- 
sizing mRNA might be even rather high. 


2.3.1 The Model by 
Sneppen and Co-authors 


Boolean Gates with Convergent Promoters 137 


One of the first detailed mathematical model of transcriptional 
interference in E. coli was presented by Sneppen et al. [15]. Even 
though this framework relies on experimental results from bacteria 
and, as we will see, it is probably not the most appropriate for a 
theoretical representation of our Boolean gates, it is interesting to 
see which kind of parameters and “events” have been taken into 
account to explain transcriptional interference. 

Initially, the strength of the sense and the antisense promoters 
were evaluated separately. As for the antisense promoter (termed, in 
the paper, the sensitzve promoter but referred here to as p,,), RNA 
polymerase (RNAp) is supposed to give rise to a sitting duck 
complex (SDC, which is related to the low transcription initiation 
rate from fs) via an irreversible reaction with rate-constant k*, 


RNAp + p,, “3 SDC (2) 


It should be noted that this is a different reaction from the 
usual binding/unbinding of RNA polymerase to the promoter, 
which is indeed a reversible reaction. The sitting duck complex is 
then described to fire an “elongation complex” EC (see Note 2) 
with firing rate-constant k;* 


ks 
SDC — EC (3) 


The steady state of a single, isolate antisense promoter is char- 
acterized by a balance between the formation of the sitting duck 
and the elongation complex. This leads to the definition of a new 
quantity, the promoter fraction occupancy 

Rs 
OS = as as (4) 
fon ae ke 

We can estimate the strength of the antisense promoter (K**, 
i.e., the equivalent of the transcription initiation rate) by multi- 
plying the firing rate-constant by the fraction occupancy we just 
calculated. This turns out to be equal to 

RS Rs 
as as fas f 
Ke = ke & = Poe (5) 


on 


As for the sense, “aggressive,” promoter, f,n, the transcription 
initiation rate, K"", is not associated with any particular formula. 
The average front-to-front time between two successive RNAps 
fired by p,, corresponds to the inverse of K" (1/K*"). Sneppen 
et al. pointed out that K*" depends on what they called self-occlusion 
time, which actually corresponds to the clearance time, i.e., the 
time RNAp takes to leave the active site and move from the pro- 
moter to the nearby DNA sequence (e.g., a ribosome binding site, 
in bacteria). The clearance time (#4) is calculated as the ratio 
between the length (/) and the speed (v) of the RNAp without 
the sigma factor (i.e., the «2{’complex). Therefore, we have that 


138 Biruck Woldai Abraha and Mario Andrea Marchisio 


2.3.2 Occlusion 


Reese? (6) 
tel l 

What the authors consider as the most important quantity to 
characterize the sense promoter for studying transcriptional inter- 
ference is the “gap time” (¢,), i.e., the average time between the 
back and the front of two consecutive RNAp molecules fired by Psy 
(in other words, the time taken by two successive front RNAps to 
reach p,,). This corresponds to the time traveled by the first RNAp 
(1/K*") minus the clearance time associated with the following 


RNAp (//7): 


sn l 
pete ee Eg (7) 
gs k sn y K™ p k sn 
A new rate-constant, termed K%", is defined as the inverse of t, 
k sn 
K= = 
* 8 


Using numerical values estimated from experiments on E. colt 
cells (not relevant for the analysis in this chapter), it is possible to 
show that, for weak promoters, it holds that K" = K". 


In the model by Sneppen et al., three kinds of transcriptional 
interference are taken into account (occlusion, sitting duck, and 
collision) and appear to be strongly intertwined. Promoter occlu- 
sion interferes with the formation of the SDC at the antisense 
promoter. RNA polymerases fired by the sense promoter occlude 
the antisense promoter during their elongation on the DNA. 
Therefore, in order to have an SDC at p,,, the gap time K% 


should be long enough to guarantee RNAp binding at p,,. The 


probability (y) that p,, is free from ECs originated by p,, is propor- 
tional to the ratio between the gap time and the “total” time (¢,) 
traveled by the RNAp fired by py (E= x=) multiplied by the 
probability P, that the actual gap time (f,) (i.e., the gap time to 
the next arriving RNAp) is long enough to permit the binding of a 
full RNAp (o2’). 

P, is quantified as 


1 2 eu 
Ps == =p eS pe (9) 
Overall, we can write that 


_ a eee _ = sn l Ki ta 
X= Pe = (1 K se (10) 

As every probability, y € [0,1]. In particular, y = 0 is verified 
when K$" approaches infinity, i.e., the gap time tends to zero and 
the SDC cannot be formed (complete occlusion). In contrast, y = 1 


2.3.3 Sitting Duck 


Boolean Gates with Convergent Promoters 139 
when K%$"=0 such that the gap time approaches infinity and, 
therefore, there is no occlusion. 

Equation 10 leads to a redefinition of the rate-constant in Eq. 2 
from k=. to yk*,,. Hence, transcriptional interference due to occlu- 


on on’ 
sion changes the antisense promoter strength K* in Eq. 5 into 


k OCC 
| Bes he 
K's = f on ll 
OCC he + yk ( ) 


Finally, the transcriptional interference due to occlusion, Iocc, 
is quantified as the ratio between K* and K6cc 
as as 
ke a5 XFon (12) 
as as 
x (ee =F ae 
From Eq. 12, we can see that the minimal value of [occ is 
1. This corresponds to the theoretical case of y = 1, i.e., the absence 
of occlusion. 


Tocc = 


Transcriptional interference takes place, mainly, at the antisense 
promoter that is considered, in this framework, as much weaker 
than the sense one. If, in the occlusion-based interference, RNA 
polymerase coming from /,, hinders the binding of RNAp at the fs 
and prevents the formation of a sitting duck complex, in the sitting 
duck interference we have that RNA polymerase fired by the sense 
promoter removes an SDC that had been assembled at the antisense 
promoter. Moreover, Sneppen et al. made the assumption that, 
whenever such a clash takes place, the SDC is always destroyed. 
Furthermore, they reckoned that that the time gap, as estimated 
above, is probably shorter if we consider that the formation of the 
transcription initiation complex at p,, and the subsequent firing of 
an RNAp are much faster reactions than the SDC formation. 
Therefore, they introduced the rate-constant K%', > K%" as a cor- 
rection to the previous system description, which shall be included 
in the definition of promoter fraction occupancy. Hence, Eq. 4 is 
rewritten as 


wks 

9% = as oy sn 13 
kon + kee + Ke 
where also the probability y has been considered. Under this 
hypothesis, the strength of the antisense promoter becomes, due 

to the presence of bot occlusion and sitting duck interference, 
pen 
+ ke + KE (14) 


on 


as _ 
k OCC+SD ~~ 


140 Biruck Woldai Abraha and Mario Andrea Marchisio 


2.3.4 Collision 


The overall transcriptional interference due to occlusion and 
sitting duck is then given by 


k as wks a ke + K sn 
Ree on ok 
K6ccisp Xx (ke ar ks) 


If we set y = 1, i.e., no occlusion, we can quantify the interfer- 
ence due to the only sitting duck 


Tocc+sp = (15) 


k as K™ 
Iepn = =l4 ie 
sae «LE a 


i.e., it is directly proportional to the shortness of the “corrected” 
gap time. 


RNA polymerase (II) collision is the most important kind of tran- 
scriptional interference in the design of our Boolean gates. At least 
to understand how a Boolean gate would work, we supposed that, 
in case of clash, the two RNA polymerases would fall from the 
DNA. Sneppen et al. gave an estimation of the probability P, that 
an RNAp fired by the antisense promoter reaches the sense pro- 
moter by escaping a collision with another RNAp fired by the sense 
promoter. They observed that P, shall depend on three main fac- 
tors: (1) the time 4, = d/v (where dis the distance between the two 
promoters and y the speed of RNAp) that the EC from p,, takes to 
arrive at ~.,3 (2) the fact that no RNAp shall be fired by #,,, during £,; 


and (3) the corrected gap time zr. Overall, P, can be expressed as 
a 1 23 sn 
| ediaet ae te — © ae (17) 


a t. 
euRe en 


Hence, a collision is avoided in two just hypothetical cases: 
when either K$" = 0 (no RNAp is fired from the sense promoter) 
or t, = 0,1.e., the distance between the two promoter is null. 

By taking into account the probability of escaping a collision, 
the “total” strength of the antisense promoter K*¥ becomes 


KT = K6ccusp gk (18) 


and the total transcriptional interference (Jy) in a convergent pro- 
moter system is calculated as 


K a8 sn 
IT= KS = Iocc+sp e*** (19) 


Sneppen et al. pointed out, however, that the equations derived 
above get increasingly inaccurate if the strengths of the two pro- 
moters are similar and the distance between the two promoters 
becomes large. They underlined that collision gives a substantial 
contribute to transcriptional interference when p,, is very strong 
and the distance from p,, is over 200 nt (this, at least, in E. colz). 


2.4 Modeling and 
Constructing Logic 
Gates via 
Transcriptional 
Interference 


Boolean Gates with Convergent Promoters 141 


The modeling framework described above was extended by Bordoy 
and Chatterjee [23] that combined antisense transcription with 
antisense regulation, i.e., phenomena such as inhibition or attenu- 
ation of translation and mRNA degradation. In a more recent work, 
Bordoy et al. [24] applied transcriptional interference to the con- 
struction of bacterial Boolean gates taking up to two inputs (AND 
and OR). They focused on only two kinds of transcriptional inter- 
ference, i.e., those due to roadblock or tandem promoters. More- 
over, they analyzed their circuits via a different approach, with 
respect to that in [23], based on the Shea-Ackers method [25], 
already applied in the modeling of synthetic gene networks. Even 
though their Boolean gates do not contemplate RNA polymerase 
collision, it is interesting to see some conclusions that arose from 
the experimental and theoretical results presented in this work. 
An AND gate was designed by means of the roadblock mecha- 
nism by placing a lac operator (lacOp) downstream of a 
tetracycline-inducible promoter containing two tet operators 
(tetOp, see Fig. 14). In principle, only in the presence of both 


A) Structure of a TU regulated by roadblock. 


tetOps 


= @ T 


lacOp 


B) Roadblock takes placed in the presence of tetracycline (tet) and the absence of IPTG. 


tet 


TetR 


tetOps 


Lacl 


Pel 9 + 


lacOp 


*s.. transcription aborted 


Fig. 14 An AND gate based on roadblock. (a) Structure of the TU representing the AND gate. (b) Roadblock 
process that takes place on the transcription unit in (a). Here, RNA polymerase collides with Lacl and falls off 


the promoter 


142 


Biruck Woldai Abraha and Mario Andrea Marchisio 


tetracycline (which docks to TetR and prevents it from binding any 
of the two tetOps) and IPTG (that interacts with LacI and inhibits 
its capability of binding lacOp), the synthesis of a fluorescent 
protein takes place. In contrast, in the presence of tetracycline and 
the absence of IPTG, RNAp can bind the promoter and start 
transcription that is then aborted due to the clash with Lacl 
bound to the DNA. Interestingly, however, a proper AND gate 
behavior depends on many parameters (e.g., the distance of lacOp 
from tetOp, the dissociation rate of LacI) that need to be fine- 
tuned in order to get the desired logic behavior—as previously 
pointed out in [26]. To this aim, a mathematical model that allows 
to carry out parameter optimization is in need. In order to simulate 
the dynamics of this system, it is necessary to know the fraction of 
the two free proteins, 1.e., not bound to the corresponding chemi- 
cals and, therefore, able to bind the DNA. Since TetR and Lacl 
behave in the same way with respect to their inducer molecules and 
the DNA, let us consider a generic protein P that can lie into two 
states: active (J), i.e., free from any inducers and, thus, able to bind 
the DNA, and inactive (P), i.e., bound to m inducers (i) and no 
longer able to anchor to the DNA. We supposed to be at steady 
state such that the total concentration of P (Py) does not change, 
ie., Pp= P+ P. 

Let us consider that there is complete cooperativity among the 
inducers such that we can write 


: hy 
ni+ PoP (20) 
where k is the association rate-constant between inducers and 
proteins. The reaction is reversible; hence, we can also write that 
re . F 
PSni+r (21) 


where k_, is the dissociation rate-constant of the P* complex. 
The dynamics of * is obtained by solving the following ordi- 
nary differential equation 


ar? 
at 


If we used the steady-state condition (which implies the con- 
servation of Py), then we have that 


0 =k P+ k_\ (Pr — P*) (23) 


=—-hPP+k {Pi (22) 


After a few algebraic steps, Eq. 23 can be rewritten as 
‘a 1 
fe =p Tt i 
1+ (x) 


7 (24) 


Boolean Gates with Convergent Promoters 143 


which gives the fraction of the protein P that are active ( fp*), i-e., 
not associated with the chemical 7 and, therefore, able to bind the 
DNA. Eq. 24 is a Hill function where Ky; is the dissociation 
constant between 7 and P at the equilibrium. In contrast, the 
fraction of proteins that are bound to i ( fp’) and cannot get access 
to the DNA is given by 


(xis) 
Ru, 
fpal-fs=— xe (25) 

Therefore, the status of protein P with respect to the inducer 
zis described by the Hill functions in Eqs. 24 and 25. They hold for 
both TetR (together with tetracycline) and LacI (IPTG). 

As mentioned above, Tet promoter and Lac operator occu- 
pancy is calculated via the Shea-Ackers method. Promoter occu- 
pancy (also referred to as the transfer function TF) is calculated as a 
ratio between the states (binding events) that allow transcription 
and all the possible states in which the promoter can lie. Hence, in 
the Shea-Ackers approach, the transfer function is the probability to 
have transcription from the promoter under exam. 

In the AND gate by Bordoy et al., the transfer function of the 
Tet promoter is given by a fraction where the numerator contains 
the only binding event that gives rise to transcription, i.e., 
K, - RNAp, where K, is the association constant between RNA 
polymerase and the promoter. The denominator of the transfer 
function contains, in contrast, all the possible binding events, i.e., 
RNAp bound to the promoter (like in the numerator), TetR to a 
single operator, and TetR to both operators. To quantify the TetR 
binding events, the probability that TetR is free and active (as in 
Eq. 24) shall be taken into account. If we use the notation K,, to 
indicate the association constant between TetR and tetOp, the 
transfer function for the Tet promoter (TFy,) becomes 


K,RNAp 
(1 + (K:RNAp) + 2(KaTetRfy,) + (KaTetRf;,)’) 
(26) 


TF yp = 


where fr, corresponds to fp, in Eq. 24. 

The other component of the AND gate is the lac operator. 
Here, the transfer function TF;, is much easier to calculate since 
no transcription can start from this location (i.e., the numerator is 
equal to 1) and in the denominator we have only the case of LacI 
bound to lacOp; therefore 


1 


TFLo = K,Laclf,, 


(27) 


where K,, is the association constant between LacI and lacOp and 
fia is a Hill function like in Eq. 24 in which the inducer is IPTG. 


144 Biruck Woldai Abraha and Mario Andrea Marchisio 


2.5 RNA Polymerase 
I! Collision and 
Composable Parts 


2.5.1 RNApIl Collision 
Without Transcription 
Regulation: A Simple 
Transcription Unit 


Bordoy et al. [24] used the transfer functions in Eqs. 26 and 27 
to calculate the amount of green fluorescent protein at fixed con- 
centrations of the two chemicals. The easiest way was to sum them 
up and multiply them by a parameter # accounting for translation. 
However, this simple formulation did not work, and further cor- 
rective terms had to be added to have a model that could fit the 
data. As shown in the original work by Shea and Ackers [25], it is 
possible to use all the probabilities, which the approach requires to 
calculate, to write a system of ODEs, whose solution gives the 
dynamics of the biological system under study. 


The Boolean gates illustrated in Figs. 5, 6, 7, 8,9, 10, 11, 12, and 
13 are different from those implemented by Bordoy and Chatterjee 
[23] since only one gene is expressed, under the sense promoter, 
whereas the RNA polymerase II fired by the antisense promoter has 
the task to oppose gene synthesis. Here, we sketch a model for gene 
expression under RNA polymerase II collision on a convergent 
promoter system based on the formalism of composable parts 
[13]. RNA polymerase II plays a fundamental role in this modeling 
framework since it supposed to be stored into a pool connected to 
every TU inside a circuit. Fluxes of RNA polymerase II are 
exchanged between the RNApII pool and the circuit TUs. More 
in general, genetic circuits are supposed to work thanks to the fluxes 
of molecules referred to as common signal carriers [27]. RNApII 
goes through any DNA part, synthesizes the pre-mature RNA, and 
goes back to its pool. By dealing with synthetic DNA sequence, we 
can neglect the presence of introns and the action of the spliceo- 
some on the pre-mature RNA [28]. A single reaction can lump all 
the necessary steps for mRNA maturation and transport to the 
cytoplasm where translation takes place. If the protein is a transcrip- 
tion factor, it will be then imported back to the nucleus. A reporter 
protein can be considered as stored into its own pool in the cyto- 
plasm (see Fig. 15). 


As we have explained in the previous section, the main feature of 
the “composable part” approach is to treat transcription as a result 
of the RNApII flux through the DNA. This flux is not continuous 
because RNApII “jumps” from a complex to another one that can 
lie or not into an adjacent part. We do not have to explicitly 
introduce an antisense promoter in our model but only a new 
species, the antisense RNApII (RNApIlIa, fired by the antisense 
promoter), and a complex where RNApIIa and RNApII (from 
the sense promoter) meet and clash. This complex leads to RNApII 
drop off the DNA (see Fig. 16). As it will become clear below, 
RNApIIa is, basically, a fake species in this model (see Note 3). 

RNApII interacts with the sense promoter P, the only one 
explicitly present in the model, giving rise to the complex [RNA- 
PII-P] 


Boolean Gates with Convergent Promoters 145 


RNAp II pool 


| transcription 


Sov pre-mRNA 


Cytoplasm > iii 
7 ¢ . biaassadainanis 


C maturation 


Fe 2 
ribosome pool oat ee ae reporter pool 


Fig. 15 Transcription and translation in eukaryotic cells according to the “Parts & Pools” (or composable part) 
framework [30]. The production of a reporter protein demands a flux of RNApll from its pool to the promoter 
upstream of the fluorescent protein and, then, down to the whole DNA TU up to the terminator, where RNA 
polymerase Il gets free and goes back to its pool (dashed arrows represent flux of molecules). Pools are clearly 
an abstraction for the place where certain molecules are stored in the cell. The same idea of fluxes applies to 
translation as well, where ribosomes flow through the mRNA and the reporter proteins, upon synthesis, move 


to their pool in the cytoplasm 


RNAplI + P “4 [RNApII — P] 


[RNApII — P| “+ RNApII + P 


Then RNApII has two possibilities: one is moving to the CDS, 
making a complex with it, and leaving P free, such that other 
molecules of RNApII can bind the DNA 


[RNApII — P] “ P+ [RNApII — CDS] 


the other possible event is a clash with an RNApII molecule fired by 
the antisense promoter (RNAplIIa). First, the two RNApIIs are 
supposed to form a complex and leave the promoter free (with 
rate-constant k., c stands for clash). Then RNApII leaves the 
DNA (and goes back to its pool), whereas the antisense RNApIIa 


146 Biruck Woldai Abraha and Mario Andrea Marchisio 


— Game 
=~ 
4 
= g — 


Visual representation 


Model reactions 


Fig. 16 Graphical representation of the reactions leading to pre-mature mRNA (pm) synthesis and RNApIl 
Collision in the framework of composable parts 


produces a non-sense RNA that we call z/. This explains why, as 
stated above, RNApIIa is not a real species in this model 


[RNApII — P] “3 P + [RNApII — RNApIIa] 
[RNApII — RNApIIa] “3 RNApII + nil 
RNApII bound to the CDS can reach the terminator 


[RNApII — CDS] “8 [RNApIT — T] 


Here, transcription ends with RNApII leaving the DNA and 
releasing the pre-mature mRNA, pm 


[RNApII — T] “5 RNApII + pm 


The pre-mRNA undergoes maturation and transport to the 
cytoplasm. The overall process is simplified in a single reaction 


pm ‘ mRNA 


2.5.2 Modeling a 
NOT Gate 


Boolean Gates with Convergent Promoters 147 


In the cytoplasm, the ribosomes (7) bind the mRNA forming a 
complex ([rm]) before starting the synthesis of the reporter protein 


(F) 
r+mRNA “4 [rm] 


[rm] yy 4 mRNA 


[rm] eS 4+ mRNA + F 


To complete the model, we shall take into account all the 
necessary degradation process. We suppose that RNApII and the 
ribosome do not decay, the DNA (i.e., the promoter) neither. As 
for the other species, we have 


dom 
pm ~ 


ey Rdnit 
nil + 


kdn 
mRNA => 


[rm] = r 


FS 

The meaning of the rate-constants in the reactions above, 
together with their numerical values and those of other parameters 
used in our simulations (e.g., species initial concentrations, com- 
partment volumes), is given in Table 2 (see Note 4). They are based 
on our previous work [20]. 

Variations in the value of k mimic different transcription 
strength of the antisense promoter with their repercussion on 
gene expression. By running a “Parameter Scan” with COPASI 
[29], we can see that the number of reporter protein decreases 
quadratically with respect to increasing values of k (from 0 to 1; 
see Fig. 17). In particular, k, = 1 s-* determines an about 61% 
reduction, in fluorescence expression, with respect to the absence 
of RNApII collision. Therefore, according to this picture, an anti- 
sense promoter stronger than the sense one (kr = 0.65’) does not 
completely suppress the production of protein F. 


By following this approach, let us see how to model the NOT gate 
in Fig. 5, where the antisense promoter is repressed by an active 
repressor R“ that interact with an inducer 7. In the absence of 7, the 
antisense promoter is OFF and, therefore, the reporter protein is 
produced in high quantity. R® gets completely inactivated when 


148 Biruck Woldai Abraha and Mario Andrea Marchisio 


Table 2 


Species, rate-constants, and other parameters used in the model of RNApll collision due to 
convergent promoters 


Name 
Va 
Ve 


(atte 
hoy 


Patgorem, nil, m, rm) 


hag 


Meaning Value Unit 
Nuclear volume 2.9E—15 L 
Cytoplasmic volume 3.91E—-14 L 
Promoter 1 Molecule 
RNA polymerase II 1000 Molecule 
Ribosome 3000 Molecule 
RNApII - promoter binding 50,000 M's! 
RNApII - promoter unbinding 0.1 sme 
[RNApII-CDS] complex formation 0.5 swt 
Clash rate-constant Arbitrary s 
RNApII — DNA drop off rate-constant 0.5 s 7 
[RNApII-T] complex formation 0.5 Sue 
Transcription rate-constant 0.6 S- 
Maturation rate-constant 5.5E—4 (30 min) s 
Ribosome — mRNA binding 35,000 Mie cone 
Ribosome — mRNA unbinding 0.015 Se 
Translation rate-constant 0.02 a 


RNA decay rate-constant 


Reporter protein decay rate-constant 


5.7E—4 (20 min) 
8.25E—05 (140 min) 


exposed to an abundant concentration of 7, such that the antisense 
promoter is accessed by RNApII and collision between RNA poly- 
merase IT molecules can take place on the DNA, with a considerable 
reduction in fluorescence expression. In our modeling approach, k, 
can no longer be treated as a fixed number, as in the previous 
section, but shall depend on the degree of repression of the anti- 
sense promoter. We express kas a function of both R” and R’ such 
as 


R? 
“1+ R* 
where &, is a constant that varies with the antisense promoter (but 
shall not be confused with a parameter that quantifies the promoter 
leakage) and takes into account the promoter strength and the 
affinity with R*. 


kh, =k 


Reporter F (molecules) 


Boolean Gates with Convergent Promoters 149 


3500 
3000 « 
2500 a 
1500 a, = oy 
500 
0 
0 0.1 0.2 03 04 0.5 0.6 0.7 0.8 0.9 1 
ke (1/s) 


Fig. 17 Reporter protein number as a function of k,. An increase in the clash rate-constant determines a 
quadratic decrease in the fluorescent protein number 


With respect to the model for the unregulated transcription 
unit, we shall consider also the production of R”, its interaction 
with the inducer 7, and the degradation of both R*® and R’. The 
inducer concentration can be treated as a constant. Moreover, for 
the sake of simplicity, we do not model a circuit made of two 
transcription units, one encoding for R” and the other for F, but 
describe R® synthesis as a zero-th order reaction 


ID ya 
—¥ R* 


whereas the interactions between repressors and inducers are 
given by 


Aaj 
i+ R® > R’ 
i hs a 
R'>i+R 
Inducer concentration is fixed at the beginning ofa simulation, 


and the decay of the chemical is neglected. In contrast, the repres- 
sors are degraded 


kdya 
ia 


R* 


kd. 
—- 1 


R? 


The rate-constants used in the above model are explained in 
Table 3. There, also their numerical values, used in our simulations, 
are given. 


150 Biruck Woldai Abraha and Mario Andrea Marchisio 


Table 3 
Parameters, and corresponding values, used to simulate a NOT gate based on RNApll collision 


Name Meaning Value Unit 

hPra R’ production rate 3.6E—10 M/s 

A R’ - inducer binding 1E+09 M's! 

7 R’ — inducer dissociation 1 s- 

kh, Antisense promoter “strength” 0.3 = 

kya Active repressor degradation 2.7E—4 (40 min) sg? 

kd, Inactive repressor degradation 2.7E—4 (40 min) Se 
10000.00 


2983.21 2982.15 2972.52 2872.39 


1237.75 
1000.00 
100.00 
10.00 
2.75 2.69 


1.00E-09 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 


Inducer concentration (M) 


Reporter proteins (molecules) 


Fig. 18 NOT gate based on convergent promoter architecture. The number of reporter proteins is shown in 
logarithmic scale 


As shown in Fig. 18, the NOT behavior of the circuit is appar- 
ent. The logic “O” output is reached at a concentration of 10 pM of 
the inducer (input = 1), at which less than 4 molecules of F are 
present in the system, compared to the almost 3000 (output = 1) in 
the absence of 7 (input = 0). 


2.5.3 Modeling a Two- The NOT gate we have just analyzed can be easily turned into the 

Input NOR Gate NOR gate in Fig. 8 by substituting the constitutive sense promoter 
with an activated one. In its ground state, the sense promoter P is 
basically OFF (any leakage is here neglected) and is turned into a 
functional, active, configuration, P*, upon binding the active acti- 
vator A”. A”, however, is inactivated when the corepressor c is 
present in the system at a proper concentration. 


3 Conclusions 


Boolean Gates with Convergent Promoters 151 


With respect to the model for the NOT gate, nothing changes 
in the description of the flux of the antisense RNApII. In contrast, 
the “current” of sense RNApII is modulated by the corepressor 
molecules. Overall, a model for this two-input gate requires only a 
few more reactions, based on mass-action kinetics, i.e., the active 
activator synthesis 


IP aa 
—y A” 


the binding and unbinding of the activator to the DNA 
A°+P4 P* 
pt? A" +P 


the interaction between the activator and the corepressor (both 
when the activator is bound to the DNA or free in the nucleus) 


c+ A* + A’ 
Ais c+ At 
ep PPP At 
and the degradation reactions 


kdaa 
AY 


i kd 
A’ +c 


kdp« 
seat 


P 


An overview of the parameters we used in this model is given in 
Table 4. 

Figure 19 shows how the NOR gate is faithfully reproduced by 
our convergent promoter design. As shown in the truth table, the 
“1” logic value corresponds to a concentration of 10 1M, as in the 
NOT gate. 


In this chapter, we have sketched how gene Boolean gates can be 
designed by means of a convergent promoter architecture, where 
we considered RNApII collision as the main mechanism responsi- 
ble for logic behavior. Potentially, complex logic circuits can be 
drawn in an automatic way by adapting the rules for gate design and 
composition, illustrated above, to the framework we developed in 
[2], where the only input the user has to supply is the circuit truth 
table. Moreover, we have also shown how our approach of model- 
ing genetic circuits by means of DNA composable parts and pools 


152 


Biruck Woldai Abraha and Mario Andrea Marchisio 


Table 4 
Parameters, and corresponding values, added to the NOT gate model in order to simulate a NOR gate 
based on convergent promoters 


Name 


[Dd 
ip 
5 


A® production rate 


A* — promoter binding 


A® — promoter dissociation 


A* — corepressor association 


A®* — corepressor dissociation 


A®* — corepressor binding on the DNA 


Active activator degradation 


Inactive activator degradation 


Active activator degradation on the DNA 


Value Unit 
3.6E—10 M/s 
1E+06 M's 
0.1 st 
1E+09 Ms? 
il sl 
1E+06 Ms 
2.7E—4 (40 min) sae 
2.7E—4 (40 min) Sua 
2.7E—4 (40 min) Se. 


Reporter proteins (molecules) 


1.00E+04 


1.00E+03 


1.00E+02 


1,00E+01 


1.00E+00 


1.00E-01 


1.00E-02 


1,00E-03 


1.00E-04 


1.00E-05 


2781.10 
3.17 
0.05 
| 5.98E-05 
nae aie — aa 
inducer/corepressor 
i (M) 

2781.10 

0.05 

3:17 

1.00E-05 1.00E-05 5.98E-05 


Fig. 19 Performance of a NOR gate based on convergent promoters. The y axis, reporting the number of 
fluorescent proteins in the cell, is in logarithmic scale. The 1-to-0 ratio of the device is of about 877-fold 


Boolean Gates with Convergent Promoters 153 


of molecules can be adapted to the simulation of RNApII collision 
by, basically, considering a single parameter, which we termed k., to 
describe the action of the antisense promoters under regulation of 
repressors or activators. A properly estimation of k, for promoter- 
transcription factor pairs might lead to predictive, uncomplicated 
models for even intricated synthetic gene networks. 


1. The piece of software corresponding to this method has been 


2. EC corresponds to what we called Pof! in our method for the 
modular modeling of genetic circuits with composable 


3. The pre-mRNA produced by RNApIIa—here called ni/—is 
not necessary and can be omitted from the model. 


4, The initial amount of all the species that are not present in 


MAM wants to thank Joerg Stelling and Fabian Rudolf for valuable 
discussions, during his Post Doc time at the D-BSSE ETH Zurich, 
that led to the idea of using convergent promoters for the construc- 


1. Karnaugh M (1953) The map method for syn- 
thesis of combinational logic circuits. Trans Am 
Inst Electr Eng 72(9):593-599 


. Marchisio MA, Stelling J (2011) Automatic 
design of digital synthetic gene circuits. PLoS 
Comput Biol 7(2):e1001083. https://doi. 
org/10.1371/journal.pcbi.1001083 

. Marchisio MA, Stelling J (2014) Simplified 
computational design of digital synthetic gene 
circuits. In: Kulkarni V, Raman K (eds) A sys- 
tems theoretic approach to systems and syn- 
thetic biology II: analysis and design of 
cellular systems. Springer-Verlag, Dordrecht 


.Rinaudo K, Bleris L, Maddamsetti R, 
Subramanian S, Weiss R, Benenson Y (2007) 
A universal RNAi-based logic evaluator that 
operates in mammalian cells. Nat Biotechnol 
25(7):795-801. https://doi.org/10.1038/ 
nbt1307 


4 Notes 
named “Parts & Pools” [30]. 
parts [13]. 
Table 2 is 0. 
Acknowledgement 
tion of gene digital circuits. 
References 


. Gander MW, Vrana JD, Voje WE, Carothers 


JM, Kiavins E (2017) Digital logic circuits in 
yeast with CRISPR-dCas9 NOR gates. Nat 
Commun 8:15459. https://doi.org/10. 
1038 /ncomms15459 


. Marchisio MA, Huang Z (2017) CRISPR-Cas 


type II-based synthetic biology applications in 
eukaryotic cells. RNA Biol 14(10):1286-1293. 
https: //doi.org/10.1080/15476286.2017. 
1282024 


. Benenson Y (2012) Biomolecular computing 


systems: principles, progress and potential. Nat 
Rev Genet 13(7):455-468. https://doi.org/ 
10.1038 /nrg3197 


. Doshi J, Willis K, Madurga A, Stelzer C, Ben- 


enson Y (2020) Multiple alternative promoters 
and alternative splicing enable universal 
transcription-based logic computation in 


154 


10. 


ll. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


mammalian cells. Cell Rep 33(9):108437. 
https: //doi.org/10.1016/j.celrep.2020. 
108437 


. Siegfried K, Endes C, Bhuiyan AF, Kuppardt A, 


Mattusch J, van der Meer JR, Chatzinotas A, 
Harms H (2012) Field testing of arsenic in 
groundwater samples of Bangladesh using a 
test kit based on lyophilized bioreporter bacte- 
ria. Environ Sci Technol 46(6):3281-3287. 
https: //doi.org/10.1021/es203511k 


Jaiswal S, Shukla P (2020) Alternative strate- 
gies for microbial remediation of pollutants via 
synthetic biology. Front Microbiol 11:808. 
https: //doi.org/10.3389 /fmicb.2020.00808 

Liu Y, Zeng Y, Liu L, Zhuang C, Fu X, 
Huang W, Cai Z (2014) Synthesizing AND 
gate genetic circuits based on CRISPR-Cas9 
for identification of bladder cancer cells. Nat 
Commun 5:5393. https://doi.org/10.1038/ 
ncomms6393 

Mingzhang Guo KH, Xu W (2020) Third gen- 
eration whole-cell sensing systems: synthetic 
biology inside, nanomaterial outside. Trends 
Biotechnol 39(6):550-559 

Marchisio MA, Stelling J (2008) Computa- 
tional design of synthetic gene circuits with 
composable parts. Bioinformatics 24(17): 
1903-1910. https://doi.org/10.1093/bioin 
formatics /btn330 

Shearwin KE, Callen BP, Egan JB (2005) Tran- 
scriptional interference—a crash course. Trends 
Genet 21(6):339-345. https://doi.org/10. 
1016/j.tig.2005.04.009 


Sneppen K, Dodd IB, Shearwin KE, Palmer 
AC, Schubert RA, Callen BP, Egan JB (2005) 
A mathematical model for transcriptional inter- 
ference by RNA polymerase traffic in Escher- 
ichia coli. J Mol Biol 346(2):399-409. https: // 
doi.org/10.1016/j.jmb.2004.11.075 


Drinnenberg IA, Weinberg DE, Xie KT, 
Mower JP, Wolfe KH, Fink GR, Bartel DP 
(2009) RNAi in budding yeast. Science 
326(5952):544-550. https://doi.org/10. 
1126/science.1176945 

Si T, Luo Y, Bao Z, Zhao H (2015) RNAi- 
assisted genome evolution in Saccharomyces 
cerevisiae for complex phenotype engineering. 
ACS Synth Biol 4(3):283-291. https://doi. 
org/10.1021/sb500074a 

Crook N, Sun J, Morse N, Schmitz A, Alper 
HS (2016) Identification of gene knockdown 
targets conferring enhanced isobutanol and 
1-butanol tolerance to Saccharomyces cerevi- 
siae using a tunable RNAi screening approach. 
Appl Microbiol Biotechnol 100(23): 
10005-10018. https://doi.org/10.1007/ 
s00253-016-7791-2 


19. 


20. 


21, 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


Biruck Woldai Abraha and Mario Andrea Marchisio 


Lewin B (2000) Genes VII. Oxford University 
Press, New York 


Marchisio MA (2014) In silico design and 
in vivo implementation of yeast gene Boolean 
gates. J Biol Eng 8(1):6. https://doi.org/10. 
1186/1754-1611-8-6 

Bintu L, Buchler NE, Garcia HG, Gerland U, 
Hwa T, Kondev J, Kuhlman T, Phillips R 
(2005) Transcriptional regulation by the num- 
bers: applications. Curr Opin Genet Dev 
15(2):125-135. https://doi.org/10.1016/j. 
gde.2005.02.006 


Regot S, Macia J, Conde N, Furukawa K, 
Kjellen J, Peeters T, Hohmann S, de Nadal E, 
Posas F, Sole R (2011) Distributed biological 
computation with multicellular engineered 
networks. Nature  469(7329):207-211. 
https: //doi.org/10.1038 /nature09679 


Bordoy AE, Chatterjee A (2015) Cis-antisense 
transcription gives rise to tunable genetic 
switch behavior: a mathematical modeling 
approach. PLoS One — 10(7):e0133873. 
https://doi.org/10.1371/journal.pone. 
0133873 


Bordoy AE, O’Connor NJ, Chatterjee A 
(2019) Construction of two-input logic gates 
using transcriptional interference. ACS Synth 
Biol 8(10):2428-2441. https://doi.org/10. 
1021/acssynbio.9b00321 


Shea MA, Ackers GK (1985) The OR control 
system of bacteriophage lambda. A physical- 
chemical model for gene regulation. J Mol 
Biol 181(2):211-230. https://doi.org/10. 
1016/0022-2836(85)90086-5 


Cox RS 3rd, Surette MG, Elowitz MB (2007) 
Programming gene expression with combina- 
torial promoters. Mol Syst Biol 3:145. https: // 
doi.org/10.1038/msb4100187 


Endy D (2005) Foundations for engineering 
biology. Nature 438(7067):449-453. https: // 
doi.org/10.1038 /nature04342 


Marchisio MA, Colaiacovo M, Whitehead E, 
Stelling J (2013) Modular, rule-based model- 
ing for the design of eukaryotic synthetic gene 
circuits. BMC Syst Biol 7:42. https: //doi.org/ 
10.1186/1752-0509-7-42 


Hoops S, Sahle S, Gauges R, Lee C, Pahle J, 
Simus N, Singhal M, Xu L, Mendes P, Kummer 
U (2006) COPASI-a COmplex PAthway SIm- 
ulator. Bioinformatics 22(24):3067-3074. 
https://doi.org/10.1093 /bioinformatics/ 
btl485 

Marchisio MA (2014) Parts & pools: a frame- 
work for modular design of synthetic gene cir- 
cuits. Front Bioeng Biotechnol 2:42. https:// 
doi.org/10.3389 /fbioe.2014.00042 


Check for 
updates 


Computational Methods for the Design of Recombinase 
Logic Circuits with Adaptable Circuit Specifications 


Ana Zuniga, Jerome Bonnet, and Sarah Guiziou 


Abstract 


Synthetic biology aims at engineering new biological systems and functions that can be used to provide new 
technological solutions to worldwide challenges. Detection and processing of multiple signals are crucial for 
many synthetic biology applications. A variety of logic circuits operating in living cells have been imple- 
mented. One particular class of logic circuits uses site-specific recombinases mediating specific DNA 
inversion or excision. Recombinase logic offers many interesting features, including single-layer architec- 
tures, memory, low metabolic footprint, and portability in many species. Here, we present two automated 
design strategies for both Boolean and history-dependent recombinase-based logic circuits. One approach 
is based on the distribution of computation within multicellular consortia, and the other is a single-cell 
design. Both are complementary and adapted for non-expert users via a web design interface, called CALIN 
and RECOMBINATOR, for multicellular and single-cell design strategies, respectively. In this book 
chapter, we are guiding the reader step by step through recombinase logic circuit design, from selecting 
the design strategy fitting to their final system of interest to obtaining the final design using one of our 
design web interfaces. 


Key words Recombinase, Logic, Synthetic biology, Web interface, Automatized design, History- 
dependent, Boolean, Single cell, Multicellular consortia 


Glossary 
— Compact: A design in which the number of parts needed to 
perform a function is reduced to its minimum. 


— Automatic design: Theoretical design performed via software, 
sometimes through a web interface. 


— Portable: Implementable in various organisms. 


— Scalable: The design principles developed at a given scale (e.g., a 
certain number of inputs) are applicable to a larger scale (here 
for an increasing number of inputs). 


— Complete: Capable of implementing all logic functions. 


— Reusable: The parts developed can be used for the construction 
of other circuits. 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_8, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


155 


156 


1 


Ana Zuniga et al. 


Introduction 


Synthetic biology consists in the engineering of new biological 
systems with the aim of (i) further understanding biology and 
(ii) providing new technological solutions to worldwide challenges 
such as climate change and healthcare. Examples of such engi- 
neered biological systems include building synthetic metabolic 
pathways in yeast to produce drugs in a more affordable manner 
[1, 2], developing synthetic live bacterial therapeutics [3-6], 
and engineering functional living biomaterials providing new solu- 
tions to healthcare challenges [7-9]. 

In nature, cells adapt to their environment by sensing and 
processing myriad signals and performing actions accordingly. Sim- 
ilarly, synthetic biological systems rely on the detection and inte- 
gration of multiple endogenous or exogenous signals for 
multiplexed biosensing [10], bioproduction of complex chemical 
compounds [11, 12], or production of biopolymers that can 
respond to change in their environment [8]. 

Synthetic biologists have mimicked electronic circuits to imple- 
ment cellular devices built from biological molecules that can pro- 
cess multiple signals (Fig. la). In this context, the main approach 
treats molecular or physical signals as binary inputs (which can have 
two different states, like in electronics), and cellular processing 
devices are assimilated to logic circuits. While this chapter is focused 
on digital logic circuits, analog logic circuits have also been imple- 
mented in living organisms [13]. 

To implement logic circuits, numerous molecular mechanisms 
have been used, such as transcription regulators [14-17], RNA 
molecules [18, 19], and site-specific recombinases [20-23]. Here, 
we focus on implementing logic circuits using recombinases, spe- 
cifically the family of serine integrases [24]. Serine integrases are a 
tool of choice for large logic circuit implementation. Numerous 
orthogonal serine integrases have been characterized [25] and have 
already been implemented in numerous organisms such as bacteria, 
plants, and mice [26]. Serine integrases recognize two DNA sites 
and recombine DNA between these two sites depending on their 
relative orientations, leading to an inversion of the DNA if the sites 
are in opposite orientation and to an excision if the sites are in the 
same orientation (Fig. 1b). In recombinase logic circuits, each 
input induces the expression of a recombinase, while circuit output 
is the expression of a reporter or production of a compound of 
interest. To implement logic functions using recombinases, promo- 
ters, terminators, and output genes are combined in a specific 
manner with integrase sites to condition the expression of the 
output gene to a particular combination of inputs (Fig. Ic) 
[20]. Recombinase logic circuits of up to six inputs have been 
implemented [22], and various design strategies have been used 


Design of Recombinase Logic Circuits 157 


Detection threshold Input 
Binary convertor 
Signal A P Cam —> 
| A ae Biosensor vere i . 
Signal B i : : inversion | Loss 
Biomaterial 
; Output} Protein and i p< 
ov “__,Jimetabolite production; ' 
BIOLOGICAL SYSTEM 
C AND gate D 
oo PT To > <a> > apie >< 
Pon Int yo Sy Nt ye Svcs 
Input2 Pr T al T " r I 
Int? Sy Mw Wi2 Int1 | i Int2 


Fig. 1 Recombinase-based logic circuits. (a) Biological logic system. Multiple signals (either environmental or 
endogenous) are detected by a cell. Each analog signal is converted into a binary signal. In this example, 
signal B and C are considered 1 as being above a defined threshold and signal A is considered 0. Then, a logic 
Circuit (here implementing the logic function (A OR B) and NOT C) processes these signals and produces a 
specific output. Biological logic systems are used to engineer biomaterials, biosensors and control protein and 
metabolite production. (b) Recombinase switch. Expression of a serine integrase is controlled by the input 
signal. Integrase recognizes two integrase sites: attB and attP sites. If the sites are in opposite orientation (left 
side), the DNA between the sites (here the promoter) is inverted leading to two new sites: attL and attR. The 
integrase alone cannot mediate recombination between attL and attR sites. If the sites are in the same 
orientation (right side), the DNA between the sites is excised, leading to a single integrase site, either attL or 
attR sites. (c) Example of a recombinase AND gate [20]. The AND logic device is composed of one promoter, 
two asymmetric terminators surrounded by integrase sites in inversion orientation, and a gene. In the absence 
of input, the output gene cannot be expressed as the RNA polymerase is blocked by the two terminators. In the 
presence of input 1 or input 2, integrase 1 (turquoise) or integrase 2 (orange) is expressed, and the terminator 
surrounding their corresponding sites is inverted. The output gene is still not expressed as one asymmetric 
terminator is still blocking transcription. Both inputs need to have been present to have both terminators 
inverted and then expression of the output gene, implementing an AND gate. (d) Example of the two-input 
history-dependent scaffold. Integrase sites are positioned to permit the expression of output genes in the 
corresponding lineage. For each state of the lineage, a different gene is expressed. The gene 0 is 
only expressed when no input is present. If input 2 is present first, gene 1 is expressed. If input 1 is present 
first, no gene is expressed (nor will be expressed) as the promoter is excised. If input 1 follows input 2, gene 
2 is expressed 


[20, 22, 23, 25, 27]. Integrase-mediated recombination is irrevers- 
ible in the absence of cofactor, and recombinase logic devices 
exhibit permanent memory. Consequently, inputs are considered 
ON if they have been present at any time in the circuit history. 
Recombinase logic devices thus implement what is called “asyn- 
chronous” logic because the inputs can be applied asynchronously. 

If integrase sites are not interleaved, the output of the system 
will be the same independently on the order of occurrence of the 


158 


Ana Zuniga et al. 


inputs, implementing Boolean asynchronous logic. This type of 
logic circuit is of interest for biosensing in which delayed readout 
can be necessary [28 ]. 

If the integrase sites are interleaved, recombination reactions 
can influence each other, and the output of the system can be 
different depending on the order of occurrence of the inputs; in 
this case the logic implemented is history-dependent. These types 
of programs are ubiquitous in biology, being involved in funda- 
mental processes like cell division (checkpoints), differentiation 
(cell-fate commitment), and development as well as microbial sur- 
vival strategies by providing fitness advantage in the evolutionary 
competition [29-31]. History-dependent programs can be used as 
temporal and spatial trackers for the decoding and encoding pro- 
cesses such as development [32 ]. 

We present in this chapter design frameworks for the imple- 
mentation of asynchronous Boolean and history-dependent logics. 

The design of recombinase logic circuits is challenging as it 
does not follow electronic logic standards. Interestingly, complex 
logic functions can be implemented within a single layer; for exam- 
ple, an XOR logic gate can be built using a terminator surrounded 
by two pairs of integrase sites in inversion orientation [20]. While 
circuits can be designed by hand for a small number of inputs, the 
task becomes daunting as the number of inputs and possible part 
combinations increases. Thus, accessible software tools for design- 
ing recombinase-based logic circuits are needed. Similar efforts 
have been done for repressor-based logic circuits (CELLO) 
[14]. We developed two computational methods for designing 
recombinase logic circuits. Each of them provides a different 
approach to systematize recombinase circuit design. The first 
design strategy called CALIN (Composable Asynchronous logic 
Integrase Networks) allows the implementation of logic circuits 
by distributing the computational labor through a multicellular 
consortia, using a limited number of standardized logic devices 
that can be mixed and matched [23, 27]. CALIN enables the 
implementation of Boolean and history-dependent logic; both are 
scalable to five inputs. 

The second design strategy, called RECOMBINATOR, uses a 
database of devices generated in a combinatorial manner within 
which architectures implementing a particular function can be 
found. The RECOMBINATOR strategy aims at implementing 
logic within a compact and single-layer device operating in single 
cells, using an ad hoc design for each case [33]. The RECOMBI- 
NATOR database is limited to devices implementing Boolean logic; 
however, the same strategy is applicable in a straightforward man- 
ner to history-dependent logic. 

The two different strategies with their automatized computa- 
tional design methods are complementary; they have different 
properties which can be advantageous depending on the context 


Table 1 


Design of Recombinase Logic Circuits 159 


Type of design required according to the application type 


Time of _‘ Physical 


Fields Applications usage confinement # strains Type of design 
Biosensing In vitro medical Short use, Confine Unlimited Multicell 
diagnostic one 
Environmental shot 
diagnostic 
On-site Long Free Medium Single or multicell 
environmental term depending on input 
diagnostic number 
Therapy Therapeutic Low, Single cell 
bacteria better = 1 
Environmental Reduce to _ Single cell 
bioremediation 1-3 
strains 
Metabolic Production by Medium Medium Medium Single or multicell 
engineering fermentation term confinement depending on input 
number 
of implementation. Therefore, choosing between one and the other 
will depend on the particular specifications determined by the user 
(see Table 1 for a few examples). 

The objective of this book chapter is to provide guidelines on 
how to design recombinase-based logic circuits using multicellular 
or single-cell designs, following the CALIN or RECOMBINATOR 
strategies, respectively (Fig. 2). First, we describe how to choose 
which design strategy to use according to the device specification, 
and then we explain how to define and write down the logic 
function to implement. Finally, we show how to use the two web 
interfaces to obtain the final logic design. 

2 Methods 
2.1 Circuit Depending on the application, the user, and the complexity of the 
Specification logic function, one design should be preferred over the other. The 


multicellular approach allows for a systematic and modular design 
by applying distributed multicellular computation using a reduced 
number of already characterized and composable biological com- 
ponents (Fig. 2). However, this approach can lead to cellular con- 
sortia composed of a high number of different cells, with issues of 
stability. Additionally, to maintain a consortium composed of dif- 
ferent strains, a confined environment is required. The single-cell 
approach enables a compact design and avoids competition pro- 
blems between strains but leads to more ad hoc designs which 


160 Ana Zuniga et al. 


2.2 Logic Function 


2.2.1 Boolean Logic 


Systematic and Ad-hoc design 
modular design oo 


Modularity 


Compactness 


Go Multicell Single cell 


Fig. 2 Comparison of the two logic circuit design strategies. In recombinase logic 
circuit design, modularity is usually inversely proportional to compactness. The 
multicellular design strategy leads to composable highly modular circuits, but 
these circuits have low compactness as they require the assembly of a multi- 
cellular consortium. The single-cell design strategy leads to compact designs 
that can be implemented in a single cell by having low modularity as each device 
can be used in a very restricted set of situations and are more challenging to 
engineer as not following any design rules 


require more expertise and heavier engineering work to obtain 
devices operating as expected (Fig. 2). Table 1 lists some logic 
circuit applications with their specifications and the favorable type 
of logic circuit design to use. 

For users without much synthetic biology expertise, a multicel- 
lular design is favorable as the optimization process is straightfor- 
ward and existing engineered devices can be used [21, 23]. 
Similarly, for users requiring the implementation of numerous 
logic circuits, the composable multicellular design is more advanta- 
geous as the majority of functions can be implemented by mixing 
and matching the existing logic devices. Of note, depending on the 
logic function of interest, the multicellular design computational 
method can lead to a single-cell system. The two computational 
methods can therefore be performed in parallel and the final logic 
circuits compared to choose the final design. 

While the single-cell design strategy that we present here could 
be extended to history-dependent logic, it for now allows the 
design of Boolean logic functions only. 


To use the CALIN or RECOMBINATOR interfaces (for single or 
multicellular designs, respectively), users first need to determine the 
logic program to implement. 


For Boolean logic, logic programs can then be written as a Boolean 
equation (f(A,B) = A.B) encoding the output state. Since the 
establishment of logic by Aristotle and Boolean algebra by George 
Boole, various terms and notations have been used to converse on 
logical reasoning and write down Boolean equations (Table 2). For 
example, to express a gene only if A signal is present and not B 
signal, the Boolean function f(A,B) = A AND NOT(B) has to be 
implemented also written as A. !B . 

While there might be some debates on which notation is “cor- 
rect,” the use of a notation usually reflects the habits and usages of 


Table 2 


Design of Recombinase Logic Circuits 161 


Conversion from truth function to Boolean function 


Classic language Mathematical language Truth function Boolean function 
True True 1 

False False 0 

OR Disjunction Vv + 

AND Conjunction A 

NOT Negation = “or! 


2.2.2 History-Dependent 
Logic 


particular scientific communities (e.g., mathematics, informatics, 
etc.). Therefore, for a given application, the only important guide- 
line is to choose one notation, be consistent, and not mix the 
different notations together. 

In the RECOMBINATOR design interface, the Boolean func- 
tion must be written using + for OR, . for AND, and ! for negation. 

In the CALIN web interface, the logic function has to be 
written down as a truth table expliciting the output state (either 
0 for OFF or 1 for ON) in each input state. 


For asynchronous history-dependent logic, the notation is not as 
rigorously established. Here, we have to consider the relative time 
of occurrence of inputs. State machine diagrams have been used for 
this purpose. In our system, we have memory and cannot reset the 
system; therefore, the number of possible events occurring is dif 
ferent and lower than in most typical state machine diagrams. We 
represent the history-dependent programs as a lineage tree in which 
each branch, or lineage, corresponds to a specific order of occur- 
rence of the inputs (for two inputs: A and then B; B and then A). 
The number of lineages is equal to N!, where N is the number of 
inputs, for instance, two lineages for two-input programs and six 
for three-inputs. So each node of the tree corresponds to an input 
state, and we represent the output on each node by a number from 
0 to 9 (0 corresponds to no output and 1 to 9 to different outputs). 

Of note, in recombinase-based devices, inputs are decoupled 
from the logic implementation. Indeed, the identity of an input is 
defined by the conditional expression of an integrase, e.g., by the 
connection of an inducible promoter responding to a signal (input) 
of interest to an integrase. Therefore, by using a single logic device 
and switching the connection between inputs and integrases, vari- 
ous logic functions can be implemented in a very straightforward 
manner and without further optimization. Logic functions imple- 
mentable using the same logic device are equivalent when inputs are 
permuted and belong to the same P-class (where P stands for 
permutation) (Fig. 3). For example, the function A.not(B) is 


162 Ana Zuniga et al. 


2.3 The CALIN Web 
Interface for 
Multicellular Design 


2.3.1 Asynchronous 
Boolean Logic 


Input A Input B 


Fig. 3 Implementation of two logic functions belonging to the same permutation 
class (P-class) using one logic device and permuting the connections between 
integrases and inputs 


P-equivalent to the function not(A).B. We widely used this prop- 
erty in the CALIN and RECOMBINATOR web interface to 
reduce the number of logic devices required or generated. While 
here exemplified with a Boolean logic function, this property is the 
same for history-dependent functions. 


The CALIN web interface allows for the systematic design of logic 
circuits operating in a single layer as a multicellular system, not 
requiring cell-cell communications nor spatial separation. 


The algorithm starts by decomposing each logic function as a sum 
of products of NOT or IMPLY functions, called sub-functions. An 
IMPLY function corresponds to f(X) = X and a NOT function to f 
(X) = NOT(X) (Fig. 4a). Each sub-function is implemented in a 
single cell using a combination of IMPLY elements in series and 
NOT elements in parallel. IMPLY elements are composed of a 
terminator surrounded by integrase sites and NOT elements by 
promoters surrounded by integrase sites (Fig. 4b). 

After entering the number of inputs and the truth table 
corresponding to the logic function of interest; the web interface 
generates the biological logic design corresponding to the number 
of strains required, the genetic circuit layout for each strain, i.e., the 
connection of integrase genes with inducible promoters 
corresponding to each input plus the logic device (Fig. 4c). For 
each logic device, a DNA sequence corresponding to an optimized 
design for E. colz is also available. 

The web interface is based on a python script which allows the 
conversion of a Boolean logic function into a genetic logic design in 


Design of Recombinase Logic Circuits 163 


A Input B IMPLY gate NOT gate 
function Decomposition in sub-function 
nee f1 = not(A).not(B).C m=> (| ' a: Ptr 

2 1 


f2-|\lC >, 9 reek {int {int 
—— "i 


Cc D 
not(A).not(B).C| ™ ” Re AB.C| * % 


Pppliperl@®  (~rrrrehD 


Fig. 4 CALIN automatized design strategy. (a) Workflow of logic circuit design. The input of the CALIN web 
interface is a truth table corresponding to the logic function of interest. The function is decomposed as a sum 
of product of IMPLY (such as f(X) = X) and NOT (such as f(X) = NOT(X)) functions, here: f = f1 + f2 with 
fi = NOT(A).NOT(B).C and f2 = A.B.C. Each sub-function is implemented in a single cell, and the composition 
of the f1 and f2 cells allows the implementation of the full logic function in a multicellular logic system. (b) 
Implementation of IMPLY and NOT functions using recombinase-based excision elements. IMPLY functions are 
implemented by surrounding by integrase sites in excision orientation a terminator placed between a promoter 
and the output gene. In the absence of input, the terminator blocks the expression of the output gene. In 
the presence of the input, the integrase is expressed, and the terminator is excised, leading to the expression 
of the output gene. The IMPLY logic element switches therefore from state 0 to state 1. NOT functions are 
implemented by surrounding a promoter by integrase sites in excision orientation. The output gene is 
expressed in the absence of the input; in presence of the input, the integrase mediates the excision of the 
promoter, and the output gene is not expressed anymore. The NOT logic element switches from 1 to 0 state in 
the presence of the input. (c) Output of the CALIN web interface: the logic device and integrase/inducible 
promoter cassette for each cell. The design of the logic devices computing the logic sub-functions is based on 
the composition of IMPLY and NOT logic elements. IMPLY logic elements are placed in series, while NOT logic 
elements are placed in parallel. The sub-function f1 (NOT(A).NOT(B).C) is composed of two NOT elements in 
parallel corresponding to the NOT(A).NOT(B) function (nested integrase sites in excision orientation surround- 
ing the promoter) and IMPLY element placed between the promoter and the gene corresponding to the C 
function. The sub-function f2 is composed of three IMPLY elements in series 


an automated manner. Here, we will detail this python script 
algorithm. 


1. The first step is to decompose the input Boolean function as 
independent sub-functions. To do so, we write the logic func- 
tion in its disjunctive normal form corresponding to a sum of 
products of input variables or their negations using the 
McCluskey algorithm [34]: 


F (15 vee yin eee ,tw) = in (11 4,,(«)) 


N corresponds to the number of inputs and M to the 
number of terms in the disjunction. ¢;, ; is either the IMPLY 
or NOT functions, such as @;, {x;) is equal to x; or not(x;). 


164 Ana Zuniga et al. 


2.3.2 History-Dependent 
Logic 


The McCluskey algorithm takes as input the ON and OFF 
output states corresponding, respectively, to the input states with 
a one or a zero as output. The algorithm provides an array of 
strings as an output, each corresponding to a sub-function. 


2. Each sub-function is translated into the corresponding logic 
and integrase device using our python algorithm. In this 
design, the number of logic devices is minimized. Indeed, 
functions belonging to the same P-class are implemented 
with the same logic devices, and only the connection between 
inputs and integrases is inverted. 

The logic device encoding each sub-function is obtained by 
extracting the number of IMPLY and NOT functions of the 
sub-function and by following the design rules detailed briefly 
above and described in detail in [27]. The integrase device is 
obtained by associating the integrase to the input, permitting 
the implementation of the desired logic function. 


3. The DNA sequence of the logic devices for E. colz is generated. 
The generated DNA sequence results from a hierarchical com- 
position of optimized logic elements [21]. Various permuta- 
tions of integrase sites have been characterized for each logic 
element corresponding to IMPLY and NOT functions with 
different integrases. Well-behaving IMPLY and NOT functions 
were selected and composed to obtain the 16 well-behaving 
logic devices permitting the implementation of all 4-input logic 
functions [21]. The same design strategy can be used to opti- 
mize logic devices for other organisms. 


In this case, the algorithm takes a lineage tree equivalent to a 
sequential truth table as input. The output corresponds to the 
biological implementation, such as for each strain: a graphical 
representation of the genetic circuit and its associated DNA 
sequences (Fig. 5). In the tree, each node corresponds to a specific 
state of the system in response to a different scenario: when no 
input occurred, when one input occurred, and when multiple 
inputs occurred in a particular sequence. 


1. The algorithm first decomposes the lineage tree into subtrees 
consisting of a single lineage containing one or multiple ON 
states. This decomposition is done by iteratively subtracting the 
lineages containing ON states (Fig. 5a). To obtain the lowest 
number of subprograms, the ones for which the highest num- 
ber of inputs occurred are prioritized in between the lineages 
with ON states (from the right to the left of the lineage tree). 


2. After decomposition, for each selected lineage, two pieces of 
information are extracted: the identity of ON states and the 
corresponding lineage. Based on these two pieces of informa- 
tion, the history-dependent logic device is constructed. The 


Design of Recombinase Logic Circuits 165 


A Decomposition in 
lineage subprogram 


First priority order : 
—_— Conversion to 


biological devices 
Second 
prey. Lineage-based 
order 
subprogram 
Multicellular 
Program as consortia 
lineage tree computing 
a desired 
program 
Lineage-based 
subprogram 


ig Q ---; 
Logic Ld dds generatior i > se : 
Program in subprogram ceil | O98: 
comp Com>— (13) 
Ceogt” | au a 
generatio 


Fig. 5 Automated design of history-dependent programs using the CALIN web interface. (a) Design algorithm. 
The python program takes as input a history-dependent program written as a lineage tree. This program is 
decomposed into sub-programs; the decomposition is performed by extracting in priority subprograms with 
an ON state at the extremity of the tree (corresponding to the state with a high number of inputs present). For 
each subprogram, the algorithm identifies the identity of an ON states and the order of the inputs in the 
lineage. Based on these two pieces of information, the biological design is obtained, including the graphical 
design of the integrase cassette and the history-dependent device, and the corresponding DNA sequence of 
the device. The full program design is obtained by composing the designs of each subprogram in different 
strains. (b) The CALIN web interface. Following the previously described algorithm takes as input the logic 
program as a lineage tree and gives as output the graphical design and DNA sequence of the device for each 
subprogram 


identity of the integrase sites is determined by the lineage and 
the position of the gene of interest (GOI) in the modular 
scaffold that executes the history-dependent programs occur- 
ring within a single lineage. More details can be found on how 
the modular scaffold allows the implementation of a single 
lineage program in [23]. The order of occurrence of inputs 
corresponding to each lineage is used to identify which sensor 
modules are needed among the different connection possibili- 
ties between control signals and integrases. 


3. Each device implementing one lineage is implemented in one 
strain. By each device, we obtain the global design for 
biological implementation of the desired history-dependent 


166 Ana Zuniga et al. 


2.4 RECOMBINATOR 
Database for Single- 
Layer Design 


Combination 
> > Permutation 


\Wy19- million times 


gene expression program (Fig. 5b). In the web interface, the 
biological implementation provided to the user consists of a 
graphical representation of the genetic circuit and the device 
DNA sequence of each strain (Fig. 5b). This automated design 
supports the implementation of all history-dependent pro- 
grams with up to five inputs. 


The maximum number of strains needed to implement an 
N-input/M-output history-dependent gene expression program 
is equal to N!, which corresponds to the number of possible 
lineages in an N-input lineage tree. However, most functions are 
implementable with fewer than the maximum number of strains, as 
corresponding to the number of lineages in which gene expression 
is required. Importantly, as the system does not use cell-cell com- 
munication, if one of the subprograms is ON, the global output of 
the system is considered to be ON. 


RECOMBINATOR is a database composed of ~19 million devices 
allowing single-cell implementation of all two- and three-input 
logic functions and up to 92% of four-input Boolean logic func- 
tions. This database was generated by combination and permuta- 
tion of recombinase sites, promoters, genes, and terminators 
(Fig. 6) [33]. 

A web interface allows the user to search the database: http: // 
recombinator.lirmm.fr. The user writes down their logic function 
of interest, either as a well-formed formula such as using the logic 
operators “ . + ! “ or as a binary number corresponding to the 
output state in each input state. 

Using the same example as previously, to express a gene only if 
signal A is present and not signal B: 


User input 


Logic function 
Biological constraints 


DATABASE 


User output 


List of 
architectures 


Fig. 6 RECOMBINATOR database and web interface. The RECOMBINATOR database was generated by 
combination and permutation of integrase sites, promoters, terminators, and genes. ~19 million architectures 
were obtained, each associated with the logic function they compute. The web interface allows searching in 
this database using as input a logic function and providing as an output a list of architectures with their 
specifications that can be sorted according to various biological constraints 


Showing 1-20 of 46 items. 


1 


2 


3 


Architecture Dnf 


{(G)]- 

((G)] 
({G])- 

(()G] 
()(G) 
(4)(G] 
(rd) 
(G){-] 
([])G 


(()16 


0010 


Nb Genes 


1 


1 


1 


Design of Recombinase Logic Circuits 167 


Nb Promoters Nb Terminators Nb Asymetric Terminators Nb Parts Gene AtEnds Cross Promotion Constraint li 


0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 no respected 
0 0 6 yes respected 
0 0 6 yes respected 


Fig. 7 Example of a list of architectures generated by the RECOMBINATOR web interface for the logic function 
A AND NOT B (A.!B). The screenshot corresponds to the ten first listed architectures without applying any 
constraint or sorting criteria. For better visualization, the table has been truncated to the right showing only 
7 of the 12 criteria 


— The well-formed formula is A.!B. 


— The binary number is 0010 but can also be written as 0100 as 
the logic device design is agnostic with input identity as 
explained previously. 


After submitting the logic function of interest, a table is gen- 
erated with various designs, all theoretically allowing the imple- 
mentation of the input logic function and their characteristics 
(Fig. 7). These designs are called architectures and are abstracted 
versions of the final biological devices. Indeed, in an architecture, 
the identity (DNA sequence) of each part is not defined; only 
function encoded by the part is. 

Each line corresponds to one architecture represented by sym- 
bols (see Table 3 for correspondence between parts and symbols). 
The characteristics of each architecture are specified in the table 
generated by the web interface (Fig. 6). Each column corresponds 
to a particular feature, such as the number of genes; promoters; 
terminators; asymmetric terminators; parts; if the gene is posi- 
tioned at the extreme segment of the device; etc. 

Architectures can be sorted according to each of these criteria. 
It is also possible to filter them by applying some constraints: 
maximum and/or minimum constraints for number of 
parts, lengths, and on/off constraints for the Boolean criteria, 
which are cross-promotion (promoters facing each other) and 
gene at the end. 

For more details on each architecture, the view button at the 
extreme right of each line leads to a new page with all the char- 
acteristics of a specific architecture and the recombination state of 
the architecture for each input state (Fig. 8). Additionally, from this 


168 Ana Zuniga et al. 


Table 3 
Symbols used in the RECOMBINATOR web interface to represent each part in the different possible 
orientations 
Part Symbol 
Promoter in forward orientation IF 


Promoter in reverse orientation 


Terminator in forward orientation 


Terminator in reverse orientation a 


Gene in forward orientation G 
Gene in reverse orientation 9 

Sites in excision orientation [ ] 
Sites in inversion orientation ©) 


Recom bi nator Search architectures 


Implementable functions : a.!b (0010) Activation a b_ output 


[(G)]J 0 O O 


Architecture [(G)]4 

Text format [b (a GF )a]b PR A o 1.0 
Boolean function (minimal form) a.!b [e950] ee 
Boolean function (binary form) 0010 i : = 8 
Length 1200 bases 

Number of genes 1 

Number of promoters 1 

Number of terminators 0 

Number of asymetric terminators 0 


Maximum distance from promoter to gene 80 


Number of parts 6 
Gene at ends no 
Cross Promotion Constraint respected 


See architectures implementing the same function. 


Fig. 8 Detailed description of the properties of one architecture and its recombination intermediates in the 
RECOMBINATOR web interface. Screenshot of the webpage obtained from the view button of the first 
architecture in the architecture list for A.!B 


3 Conclusion 


Design of Recombinase Logic Circuits 169 


page, logic functions belonging to the same P-class are accessible 
with their corresponding architecture. Indeed, to go from the 
implementation of one logic function to a logic function belonging 
to the same P-class, only the identity of the integrase sites has to be 
changed (i.e., in the RECOMBINATOR web interface, the color of 
the integrase site pairs). 

A lot of information is provided to the user, and for most logic 
functions, a large number of architectures are available for their 
implementation. Passing from an architecture to a biological imple- 
mentation can be challenging and will require optimization; choos- 
ing the simplest architecture and the one most suited to the final 
chassis will increase the probability of successful biological 
implementation. 


We presented two design strategies to implement asynchronous 
Boolean logic programs in living organisms using serine integrases. 
These two design strategies are complementary: one is modular and 
scalable but requires a multicellular system while the other design is 
ad hoc such as can be more complex to engineer but is single cell. 

The RECOMBINATOR database and the CALIN inter- 
face allow the design of Boolean logic circuits up to five inputs. 
For CALIN, the python software supports the design of circuit 
with inputs higher than five and is available on GitHub (https: // 
github.com/synthetic-biology-group-cbs-montpellier/calin); we 
limited the web interface to five inputs to reduce lagging of the 
service. 

The automatization of the design of history-dependent pro- 
grams is only available with CALIN, which allows the implementa- 
tion of programs with up to five inputs and ten outputs. However, 
the strategy of the RECOMBINATOR database could be applied 
to history-dependent programs by allowing the generation of 
devices with interlinked integrase sites and dependent on the 
order of occurrence of inputs. The first challenge would be the 
size of this database which will significantly increase. 

Of note, we have experimentally validated the CALIN frame- 
work for both Boolean and history-dependent logic [21, 23], while 
the architectures provided by RECOMBINATOR are for now only 
theoretical. The large diversity and peculiarities of some of the 
designs will probably require the user to test several different 
architectures and optimize their behavior on a case-by-case basis. 

We hope that this book chapter will guide synthetic biologists 
as well as scientists from other fields to choose the more coherent 
design strategy for their specific application and facilitate the design 
of their logic devices using our design web interfaces. 


170 


Ana Zuniga et al. 


References 


1. 


10. 


ll. 


Galanie S, Thodey K, Trenchard JJ et al (2015) 
Complete biosynthesis of opioids in yeast. Sci- 
ence 349:1095-1100. https://doi.org/10. 
1126/science.aac9373 


. Paddon CJ, Westfall PJ, Pitera DJ et al (2013) 


High-level semi-synthetic production of the 
potent antimalarial artemisinin. Nature 496: 
528-532. https://doi.org/10.1038/ 
nature12051 


. Isabella VM, Ha BN, Castillo MJ et al (2018) 


Development of a synthetic live bacterial thera- 
peutic for the human metabolic disease phenyl- 
ketonuria. Nat Biotechnol 36(9):857-864. 
https: //doi.org/10.1038 /nbt.4222 


. Praveschotinunt P, Duraj-Thatte AM, Gelfat I 


et al (2019) Engineered E. coli Nissle 1917 for 
the delivery of matrix-tethered therapeutic 
domains to the gut. Nat Commun 10:5580. 
https://doi.org/10.1038/s41467-019- 
13336-6 


. Cui M, Sun T, Li S et al (2021) NIR light- 


responsive bacteria with live bio-glue coatings 
for precise colonization in the gut. Cell Rep 36: 
109690. https://doi.org/10.1016/j.celrep. 
2021.109690 


. Kalos M, June CH (2013) Adoptive T cell 


transfer for cancer immunotherapy in the era 
of synthetic biology. Immunity 39:49-60. 
https://doi.org/10.1016/j.immuni.2013. 
07.002 


. Bryksin AV, Brown AC, Baksh MM et al 


(2014) Learning from nature — novel synthetic 
biology approaches for biomaterial design. 
Acta Biomater 10:1761-1769. https://doi. 
org/10.1016/j.actbio.2014.01.019 


. Kalyoncu E, Ahan RE, Ozcelik CE, Seker UOS 


(2019) Genetic logic gates enable patterning of 
amyloid nanofibers. Adv Mater 31(39): 
e1902888. https://doi.org/10.1002/adma. 
201902888 


.Tang T-C, Tham E, Liu X et al (2021) 


Hydrogel-based biocontainment of bacteria 
for continuous sensing and computation. Nat 
Chem Biol 17:724-731. https://doi.org/10. 
1038 /s41589-021-00779-6 

Chang H-J, Voyvodic PL, Zuniga A, Bonnet J 
(2017) Microbially derived biosensors for diag- 
nosis, monitoring and epidemiology. Microb 
Biotechnol 10(5):1031-1035. https://doi. 
org/10.1111/1751-7915.12791 

Kim SG, Noh MH, Lim HG et al (2018) 
Molecular parts and genetic circuits for meta- 
bolic engineering of microorganisms. FEMS 
Microbiol Lett 365:fny187. https://doi.org/ 
10.1093/femsle/fny187 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22, 


23. 


24. 


Pham HL, Wong A, Chua N et al (2017) Engi- 
neering a riboswitch-based genetic platform for 
the self-directed evolution of acid-tolerant phe- 
notypes. Nat Commun 8:411. https://doi. 
org/10.1038/s41467-017-00511-w 
Sarpeshkar R (2014) Analog synthetic biology. 
Philos Trans A Math Phys Eng Sci 372: 
20130110. https://doi.org/10.1098 /rsta. 
2013.0110 

Nielsen AK, Der BS, Shin J et al (2016) 
Genetic circuit design automation. Science 
352(6281):aac7341. https://doi.org/10. 
1126/science.aac7341 

Macia J, Manzoni R, Conde N et al (2016) 
Implementation of complex biological logic 
circuits using spatially distributed multicellular 
consortia. PLoS Comput Biol 12:e1004685 
Gander MW, Vrana JD, Voje WE et al (2017) 
Digital logic circuits in yeast with CRISPR- 
dCas9 NOR gates. Nat Commun 8:15459. 
https: //doi.org/10.1038 /ncomms15459 
Anderson DA, Voigt CA (2021) Competitive 
dCas9 binding as a mechanism for transcrip- 
tional control. Mol Syst Biol 17:e10512. 
https: //doi.org/10.15252/msb.202110512 
Win MN, Smolke CD (2007) A modular and 
extensible RNA-based gene-regulatory plat- 
form for engineering cellular function. Proc 
Natl Acad Sci U S A 104:14283-14288. 
https: //doi.org/10.1073/pnas.0703961104 
Green AA, Kim J, Ma D et al (2017) Complex 
cellular logic computation using ribocomput- 
ing devices. Nature 548(7665):117-121. 
https: //doi.org/10.1038 /nature23271 
Bonnet J, Yin P, Ortiz ME et al (2013) Ampli- 
fying genetic logic gates. Science 340: 
599-603. https://doi.org/10.1126/science. 
1232758 

Guiziou S, Mayonove P, Bonnet J (2019) Hier- 
archical composition of reliable recombinase 
logic devices. Nat Commun 10:456. https: // 
doi.org/10.1038/s41467-019-08391-y 
Weinberg BH, Pham NTH, Caraballo LD et al 
(2017) Large-scale design of robust genetic 
circuits with multiple inputs and outputs for 
mammalian cells. Nat Biotechnol 35:453-462 
Zuniga A, Guiziou S, Mayonove P et al (2020) 
Rational programming of history-dependent 
logic in cellular populations. Nat Commun 
11:4758. https://doi.org/10.1038/s41467- 
020-18455-z 

Merrick CA, Zhao J, Rosser SJ (2018) Serine 
integrases: advancing synthetic biology. ACS 
Synth Biol 7:299-310. https://doi.org/10. 
1021/acssynbio.7b00308 


25. 


26. 


27. 


28. 


29. 


Yang L, Nielsen AAK, Fernandez-Rodriguez J 
et al (2014) Permanent genetic memory with 
>1-byte capacity. Nat Methods 11:1261-1266 
Fogg PCM, Colloms S, Rosser S et al (2014) 
New applications for phage integrases. J Mol 
Biol 426:2703-2716. https://doi.org/10. 
1016/j.jmb.2014.05.014 

Guiziou S, Ulliana F, Moreau Vet al (2018) An 
automated design framework for multicellular 
recombinase logic. ACS Synth Biol 7: 
1406-1412. https://doi.org/10.1021/ 
acssynbio.8b00016 

Courbet A, Endy D, Renard E et al (2015) 
Detection of pathological biomarkers in 
human clinical samples via amplifying genetic 
switches and logic gates. Sci Transl Med 
7(289):289ra83 

Byrne KM, Monsefi N, Dawson JC et al (2016) 
Bistability in the Racl, PAK, and RhoA signal- 
ing network drives actin cytoskeleton dynamics 
and cell motility switches. Cell Syst 2:38-48. 
https: //doi.org/10.1016/j.cels.2016.01.003 


Design of Recombinase Logic Circuits 


30. 


31. 


32. 


33. 


34. 


171 


Harmon B, Chylek LA, Liu Y et al (2017) 
Timescale separation of positive and negative 
signaling creates history-dependent responses 
to IgE receptor stimulation. Sci Rep 7:15586. 
https: //doi.org/10.1038/s41598-017- 
15568-2 

Wolf DM, Fontaine-Bodin L, Bischofs I et al 
(2008) Memory in microbes: quantifying 
history-dependent behavior in a bacterium. 
PLoS One 3:e1700. https://doi.org/10. 
1371 /journal.pone.0001700 

Guiziou S, Chu JC, Nemhauser JL (2021) 
Decoding and recoding plant development. 
Plant Physiol 187:515-526. https://doi.org/ 
10.1093 /plphys /kiab336 

Guiziou S, Pérution-Kihli G, Ulliana F, Leclére 
M (2019) Exploring the design space of 
recombinase logic circuits. bioRxiv 2019: 
711374 

Enderton H, Enderton HB (2001) A mathe- 
matical introduction to logic. Academic Press 


Check for 
updates 


Designing a Model-Driven Approach Towards Rational 
Experimental Design in Bioprocess Optimization 


Jing Wui Yeoh and Chueh Loo Poh 


Abstract 


To enable a more rational optimization approach to drive the transition from lab-scale to large industrial 
bioprocesses, a systematic framework coupling both experimental design and integrated modeling was 
established to guide the workflow executed from small flask scale to bioreactor scale. The integrated model 
relies on the coupling of biotic cell factory kinetics to the abiotic bioreactor hydrodynamics to offer a 
rational means for an in-depth understanding of two-way spatiotemporal interactions between cell beha- 
viors and environmental variations. This model could serve as a promising tool to inform experimental work 
with reduced efforts via full-factorial in silico predictions. This chapter thus describes the general workflow 
involved in designing and applying this modeling approach to drive the experimental design towards 
rational bioprocess optimization. 


Key words Bioprocess, Cell kinetic model, Computational fluid dynamics, Integrated modeling, 
Vanillin bioproduction 


1. Introduction 


To address global sustainability concern, microbial cells have been 
extensively utilized as cell factories to synthesize various valuable 
products [1]. However, transitioning from small lab-scale experi- 
ments to large industrial bioprocesses is often challenged by the 
non-homogenous mixing conditions encountered in bioreactors, 
which profoundly impacts the cell growth and bioproduction 
performance [2]. Understanding how the cells respond to the 
environmental variations temporally and spatially could pave the 
ways towards a more rational optimization of the bioprocesses. 
Despite all the previous modeling efforts [3-5], there is a lack of 
consensus on the established practices to drive the experiments 
from small-scale to large-scale bioprocesses. To achieve a more 
rational model-driven approach in fine-tuning of bioproduction 
performance, we have established a systematic model-driven 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_9, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


173 


174 Jing Wui Yeoh and Chueh Loo Poh 


framework built upon an integrated model, working in parallel with 
rational experimental designs from flask to bioreactor scales, for the 
bioprocess optimization [6]. 

Using ferulic acid to vanillin biotransformation as a case study 
[7, 8], an integrated model coupling both cell factor kinetics and 
the bioreactor computational fluid dynamics in 3D has been devel- 
oped to assess the impacts of impeller rotational speed (RPM) and 
air supply rate (LPM) on the biomass growth and bioproduction 
performance. These variables are deemed to account for the overall 
aeration and mass transfer rate within the stirred-tank bioreactor 
[9]. This chapter describes the steps and general strategies involved 
in the experimental characterization studies from flask scale to 
bioreactor scale, elucidates the cell kinetic model development at 
different phases and the parallel working with experimental studies 
for validation, and finally outlines the setup of computational fluid 
dynamic (CFD) case and ways to analyze and visualize the results. 
This model-driven framework can easily be generalized to other 
bioproduction processes, which enables us to fully harness the 
intuitive and non-intuitive knowledge from experiments and trans- 
late into a quantitative model to be actively used in rational biopro- 
cess optimization across different phases [2]. 


2 Materials and Methods 


2.1 Plasmid 
Construction 


This section describes the details of materials and methods used in 
the corresponding subsections. In general, Fig. | illustrates a sys- 
tematic model-driven framework on how in silico modeling 
approach works in synergy with experimental studies across differ- 
ent phases (from small flask scale to larger bioreactor scale) for a 
rational bioprocess optimization. 


This section briefly describes the general methods involved in plas- 
mid construction such as plasmid design, assembly, and transfor- 
mation and highlights the different parts of the plasmid used in this 
study. 


e¢ Perform all plasmid designs and sequencing analyses using 
Benchling designer (Benchling, Inc. San Francisco, CA, USA). 


¢ Obtain the backbone plasmid pBbE8k (JBEI Part ID: 
JPUB_000036, colE1 ori, Kan’) from Addgene (Addgene, 
MA, USA). 

¢ Use arabinose-induced pBAD promoter with default ribosome 
binding site (rbsD) of strong relative strength to drive the 
feruloyl-CoA synthase (Fcs) gene which encodes the enzyme 


used to convert substrate ferulic acid to intermediate feruloyl- 
CoA. 


Integrated Model-Driven Approach for Bioprocess Design 175 


8 eS en ee 2|'3|'3| 


Experiment j 


L a = Integrated Model: 
, Flask scale » Bioreactor scale Cell kinetics and 


bioreactor hydrodynamics 


he 


In silico 


Full-factorial Predictions 


Fig. 1 A systematic in silico model-driven framework working in synergy with experimental studies from flask 
scale to bioreactor scale to enable a rational optimization of bioprocess design. The top panel shows (left) 
samples of plasmid and experimental results obtained at flask studies (the performances when adding 
different substrate concentrations and the corresponding inhibitory effects) and (right) temporal profiles of 
different variables compared with results from cell kinetics and integrated models at specific RPM and LPM 
conditions and performances at different combinatorial conditions. The bottom panel illustrates the workflow 
of in silico model development starting from (left) developing a cell kinetic model to account for the principal 
cell variables, genetic circuit enzyme expression dynamics, and the bioconversion pathway of ferulic acid to 
vanillin. (Middle right) This is followed by the development of bioreactor geometry model for simulation using 
computational fluid dynamics. Integrating the cell kinetic model into the bioreactor CFD model allows the 
visualization of the temporal profiles and spatial distributions of variables across the entire bioreactor to 
examine the mixing effects. Full-factorial simulations under different combinatorial conditions can be 
performed to identify the optimal operating condition. (Parts of the figure adopted from [6] with permission) 


¢ Use aTc-induced pTet promoter with BBa_B0034 (rbs34) of 
medium relative strength to express the enoyl-CoA hydratase/ 
aldolase (Ech) gene that encodes the enzyme to transform 
feruloyl-CoA into product vanillin. 


e Perform Gibson assembly following the standard molecular 
biology techniques. 


e Transform the plasmid into TOP-10 chemically competent 
E. coli (Invitrogen) to be used for the bioconversion of ferulic 
acid to vanillin. 


¢ Culture the cells in minimal M9 media with 0.2% (w/v) casa- 
mino acids and 0.2% (v/v) glycerol as a sole carbon source. 


2.2 Growth Expression of heterologous enzymes could impose significant met- 
Decoupling Strategy abolic burden on cell vitality by redirecting the limited resource 
pool away from growth. It is thus important to decouple cell 
growth phase from the expression phase to maximize the biosyn- 
thesis yield without compromising the cell viability and 


176 Jing Wui Yeoh and Chueh Loo Poh 


2.3 Flask Study 
Characterization 


productivity. This turns out to be indispensable when dealing with 
substrate /product or intermediates that are toxic to the cells. 


Grow the cells overnight and inoculate and incubate the cells in 
freshly prepared medium for 2-3 h. 


Inoculate the cells on flask to reach starting OD6go9 of 0.1. 


Grow the cells in control condition to identify the growth 
profile over time across different growth phases (lag phase, 
exponential log phase, slowdown phase, stationary phase). 


Repeat the process for new culture with the same starting 


Grow the cells until reaching the linear exponential growth 
phase, which is usually after 2-3 h from the start of the 
experiment. 


Add the two inducers to the culture to trigger the expression of 
the two heterologous enzymes required for the biotransforma- 
tion pathway. 


Continue growing the cells until reaching the slowdown 
growth phase. 


Add the substrate (ferulic acid in this case) to the culture to start 
the bioconversion process. 


Measure the bioproduction yield and productivity at the end of 
experiment and adjust the point for induction and substrate 
addition accordingly to determine the optimal protocol for 
maximal productivity and yield. 


Before implementing in large-scale bioreactor, experiments could 
be conducted at the flask scale to optimize the strain, medium 
composition, and bioproduction protocols and duration. More 
importantly, these small-scale experiments enable one to acquire 
preliminary quantitative data required for parameter inference of 
the cell mechanistic model which underpins more detailed model 
entailed at larger bioreactor scale. 


Perform experiments on flask scale at varying concentrations to 
determine the optimal glycerol supply and casamino acid con- 
centration required for optimal cell growth. 


Ensure that glycerol supply would be the limiting factor when 
cells reach stationary phase and other supplements should be in 
abundance for ease of regulation and modeling purpose. 


Conduct experiments subjected to different inducers (arabinose 
and aTc) concentrations to identify the optimal concentrations 
that drive the expression of the two enzymes to achieve the 
maximal bioproduction performance at minimal period. 


2.4 Bioreactor Study 
Characterization 


Integrated Model-Driven Approach for Bioprocess Design 177 


Characterize the inhibitory impacts on growth profile when 
supplemented with different substrate or product concentra- 
tions, considering the potential bacteriostatic effects of phenolic 
compounds (ferulic acid) and aromatic aldehydes (intermediate 
feruloyl-CoA, vanillin). 


Use 60 ml minimal M9 media in 250 ml flasks and inoculate 
with 0.6 ml of E. cols in seed medium (overnight culture in 
Luria broth (LB)) supplemented with an appropriate amount of 
antibiotics kanamycin. 


Add inducers arabinose (0.2%) and aTc (200 nM) simulta- 
neously to the culture at the start of the exponential growth 
phase (at about 3.5 h for our case). 


Administer substrate ferulic acid dissolved in solvent dimethyl 
sulfoxide (DMSO) when the cells begin to enter the slowdown 
phase (at approximately 5.5 h in our study). 


Collect samples with a sampling volume of 1.5 ml at every 2-h 
interval. 


Measure the cell optical density OD value at 600 nm using a 
spectrometer (Eppendorf BioPhotometer Plus). 


Apply dilution method for OD above 2 to obtain a more 
accurate reading. 


Measure the ferulic acid and vanillin concentrations using 
HPLC (Shimadzu SPD-M20A Prominence Diode Array 
Detector) with mobile phase of 40% methanol and 60% (1% 
acetic acid). 


Measure glycerol concentration using HPLC (Agilent Technol- 
ogies 1260 Infinity Refractive index detector) with mobile 
phase of 0.005 M H2SOx. 


Perform the experiments in duplicate/triplicate and compute 
the average and standard deviation values. 


To better capture the interactions between the biomass growth and 
the experimental factors, bioreactor experiments are carried out at 
control (non-induction) and induced state to characterize the bio- 
mass growth and biotransformation performance when subjected 
to varying RPM and LPM combinatorial operating conditions 
which account for the overall aeration and mass transfer rate within 
bioreactor. 


Use a stirred-tank bioreactor with a single wall disk bottom 
vessel (Winpact Evo Fermentation System FS-07 series, Solid 
State Fermentation System FS-V-SAO05P) to study the batch 
culture system for bioproduction. 


Use a 1.5 L fermenter with a working volume of 1 L with a 
geometry of 10 cm inside diameter and 20 cm height. 


178 


Jing Wui Yeoh and Chueh Loo Poh 


Install two four-blade Rushton-type impellers with a six-hole air 
sparger above the bottom of the tank. 


Supply the air with an air pump with 0.22 um filter. 


Steam sterile the bioreactor filled with 1 L of M9 minimal media 
before running experiments. 


Set the temperature to 37 °C and set the aeration and agitation 
accordingly to keep it run overnight to allow the medium to 
reach dissolved oxygen (DO) saturation. 


Inoculate 20 ml of E. coli in seed medium (from overnight 
culture in LB) to the reactor with an appropriate amount of 
antibiotics kanamycin. 


Following the growth decoupling technique and similar to the 
flask studies, add the two inducers (0.2% arabinose and 200 nM 
aTc) upon reaching the cell exponential phase at about 3.5 h to 
trigger the expression of enzymes Fcs and Ech. 


Administer the substrate 0.1% ferulic acid in DMSO solvent to 
initiate the biotransformation process and left to run for 5 h; the 
experiment ends at 10.5 h. 


Collect 5 ml liquid sample at 2 h intervals for measurement. 


Measure the cell OD, glycerol, ferulic acid, and vanillin using 
the similar techniques mentioned for flask-scale studies. 


Conduct the bioreactor experiments at ten different dual-factor 
(RPM and LPM) combinatorial conditions as shown in Fig. 2 
(0 RPM-0.5 LPM, 0 RPM-3.5 LPM, 100 RPM-0.5 LPM, 
150 RPM-0.5 LPM, 225 RPM-0.5 LPM, 400 RPM-0.5 
LPM, 225 RPM-0 LPM, 225 RPM-1 LPM, 225 RPM-3.5 
LPM, and 400 RPM-3.5 LPM) for model validation while 
other factors are kept constant. 


LPM 


Fig. 2 A schematic diagram showing the different RPM and LPM combinatorial 
Operating conditions conducted experimentally, chosen rationally based on the 
sensitive regions from dose-response curves under different metrics (peak OD 
and productivity) 


2.5 Cell Factory 
Kinetic Modeling 


2.6 Flask-Scale 
Model Development 
and Validation 


Integrated Model-Driven Approach for Bioprocess Design 179 


Here is to delineate the general flows or techniques used in devel- 
oping the cell factory kinetic model. More detailed model develop- 
ment and validation at flask scale and bioreactor scale will be 
elaborated in the subsequent sections. 


e The model simulations were implemented in MATLAB 
R2018b (MathWorks) for our study. 


e Derive the kinetic model formulation in the form of ordinary 
differential equations (ODEs) to describe the different cellular 
or environmental variables. 


¢ Solve the ODEs using numerical methods such as forward Euler 
approximation or MATLAB built in function odel5s. For 
dynamic profiles with fluctuations, forward Euler approxima- 
tion seems to provide a more stable and accurate results after 
tuning the time step, whereas odel5s provides higher compu- 
tational speed due to its adaptive characteristics of variable step 
and variable order but might not provide an accurate represen- 
tation of the result simulated in dynamic manner. 


¢ Apply global and/or local optimizers for parameter estimation 
when comparing model simulations against experimental data 
points. 


e In our case, function fminsearchbnd, which is a boundary con- 
strained local optimization algorithm based on Nelder-Mead 
simplex search method, was utilized to perform the parameter 
estimation given initial guesses and lower and upper boundary 
conditions. “None” can be used for those parameters without 
known boundaries. 


e The full-cell factory model encompasses the descriptions of 
biomass growth, nutrient consumption, dissolved oxygen 
dynamics, heterologous gene circuit enzyme expression, and 
enzyme catalytic biotransformation pathway. 


A primary step towards developing the integrated modeling frame- 
work begins with the development ofa preliminary cell mechanistic 
model which can capture the phenomena observed at small flask- 
scale experimental studies. This enables one to quickly come up 
with a coarse-grained yet informative model to quantitatively cap- 
ture the essential components, which could serve to optimize the 
experimental designs at the early phase. This section outlines the 
steps involved in the early model development and validation at 
flask-scale studies. 


¢ Develop a simple biophysical kinetic model as illustrated in 
Fig. 3 to capture the cell growth profile, genetic circuit enzyme 
expression (enzymes Fcs and Ech in our case), and the biotrans- 
formation pathway (from ferulic acid to vanillin formation). 


180 Jing Wui Yeoh and Chueh Loo Poh 


Ferulic Acid Vanillin 
(substrate) (product) 
Pathway Level Model 
CoA-SH + 
<< AMP+PP es S-CoA H,0 Acety ies 

Glycerol dam Glycerol. 3,0 CHs — OCH, ech ech O, CH; 
(carbon OH OH 
source) Ferulic Acid Feruloyl-CoA Vanillin 


Biomass Growth 
Logistic Growth Model Genetic Circuit Model 


Fig. 3 A preliminary cell mechanistic model consisting of simple growth model, genetic circuit model, and 
pathway-level model developed using flask-scale experimental data, which forms the basis for more detailed 
model at bioreactor scale 


¢ Asimple Verhulst rate growth model, which is a logistic growth 
formalism, was adopted in this study to describe the sigmoidal 
growth profile of a batch culture comprising three/four phases 
(starting short lag phase, exponential log phase, and slowdown 
phases followed by stationary phase), irrespective of the actual 
constraining factors that define the carrying capacity of the cells. 


e Describe the inducible enzyme expression of the genetic circuit 
by a system of nonlinear ODEs, where the rates of change in 
mRNAs and proteins are defined after applying the law of mass 
balance and the hill equation is used to describe the transcrip- 
tional control by inducers. 


e¢ Apply Michaelis-Menten equation to model the catalytic bio- 
transformation of ferulic acid into the intermediate feruloyl- 
CoA, finally leading to vanillin formation involving the enzymes 
Fes and Ech. 


e It is important to consider the ratio of the molar mass for the 
substrate, intermediate, and product into the equation to 
account for the difference in the measured concentration unit 
(molar). 


e Fit the model to the measured experimental data from flask scale 
to infer the various kinetic parameters, which can be used to 
determine the optimal point of induction and the duration of 
biotransformation run. 


Integrated Model-Driven Approach for Bioprocess Design 181 


e¢ Determine the dose-response parameters using other flask-scale 
experiments such as examining growth profiles under varying 
glycerol concentrations and the inhibitory impacts of substrate 
or product on cell growth. The dose-response formulations and 
parameters derived from these data lay the groundwork for the 
model at bioreactor scale. 


¢ This model will form the basis for scaling up the cell model to 
incorporate other environmental factors controlled in bioreac- 
tor setting. 


2.7  Bioreactor-Scale § Moving towards the bioreactor scale, it is essential to account for 


Cell Model the relevant external environmental variations and the impacts 
Development and imposed on the cell behaviors and bioproduction performance. 
Validation Here is to highlight the strategies involved in developing the 


Glycerol 
(carbon 
source) 


RPM LPM Aerobic E Anaerobic @ 
\ Respiration: Respiration @®@ ef 
[/o. ogee [aa 
Dissolved 
oxygen \ Biomass Growth Genetic Circuit Model 
Monod Model 


two-way interaction detailed model as demonstrated in Fig. 4 sup- 
ported by rational experimental designs for model validation. 


¢ To better account for the different external environmental fluc- 
tuations observed in bioreactor, we move on to develop a more 
detailed model considering those impacts starting from the 
nutrient consumption and dissolved oxygen level at bioreactor 
scale. 


Ferulic Acid Vanillin 
(substrate) (product) 


Pathway Level Model 


CoA-SH 0 
ATP = AMP+PP'S S-COA |, 


0 Acetyl-CoA 


ee 
ech A, CH, 
OH 


OCH fes OCH ech 
mum Glycerol 5°" OH 
Feruloyl-CoA Vanillin 


Ferulic Acid 


Fig. 4 More detailed cell model capturing the interactions with bioreactor environmental factors such as 


nutrient co 


nsumption, dissolved oxygen level under influences of varying RPM and LPM, and inhibitory effects 


imposed by toxic substrate and product 


182 


Jing Wui Yeoh and Chueh Loo Poh 


Apply the Monod-based formalisms to describe these two key 
environmental factors: glycerol as carbon source and dissolved 
oxygen dynamics. 


Validate the developed model with the measured experimental 
data collected from the batch culture of 1 L bioreactor study 
conducted in parallel under a specific operational setting, which 
is assumed to be the control condition without induction. 


To capture the two-way interactions which accounts for the 
impact of these fluctuating environmental conditions on the 
biomass growth, the earlier logistic-based model of biomass 
was modified to accommodate the nutrient- and oxygen- 
dependent aerobic respiration. 


In view of the facultative anaerobic nature of E. coli, anaerobic 
respiration can also be incorporated into the biomass equation 
to factor in the condition under low oxygen supply. 


It is also important to consider inhibitory effects on the biomass 
growth imposed by both substrate (ferulic acid) and product 
(vanillin) by incorporating the dose-response formulations and 
parameters derived from flask-scale experiments. 


To mimic the non-homogenous condition due to mixing, in 
this study, we focus on varying the two critical bioreactor para- 
meters: the impeller stirring speed RPM and air flow rate LPM, 
which are deemed to account for the overall aeration and mass 
transfer rate that have profound impacts on cell growth and 
bioproduction performance. 


To examine the combined impacts of these two parameters, 
different combinatorial experiments can be performed under 
the variations of the two determinants. 


To study the effect of LPM on biomass growth and biotransfor- 
mation performance, we can fix the RPM at the middle range 
such as 225 and then carry out bioreactor experiments with 
induction spanning across four different LPM values (0, 0.5, 
1s3.5); 

The metrics of biomass growth (peak OD) and biotransforma- 
tion performance (productivity, yield) can then be computed 
from the experimental measurements. 


Calculate the percent yield of the product vanillin as a ratio of 
actual measured yield to the theoretical yield in percentage. 


Compute the productivity as the gradient of product formation 
over the biotransformation duration. 


From the dose-response curve, identify the most sensitive LPM 
that renders the most prominent change in the biomass and 
biotransformation metrics (0.5 LPM in our case). 


2.8 CFD Simulation 
Setup and Integrated 
Modeling 


Integrated Model-Driven Approach for Bioprocess Design 183 


e To examine the effect of RPM, fix the LPM at the most sensitive 


value (0.5 LPM) and then perform bioreactor experiments with 
induction over different RPMs (0, 100, 150, 225, 400). 


e Fit the dose-response behaviors using the Hill equation to 
analyze the impacts and sensitivities of the two determinants 
on the biomass growth and biotransformation performance. 


e Plot the productivity metric against the peak OD to identify 
their correlation behavior; a nonlinear relation has been 
observed in our study. 


¢ To better elucidate the combined influences of RPM and LPM 
on the biomass growth and biotransformation performance, we 
incorporate their compounded impact on determining the oxy- 
gen mass transfer coefficient and formulate a semi-empirical 
equation to account for the growth-associated bioproduction 
performance featuring the nonlinear relation. 


e Parameterize the developed model based on the time-response 
profiles of different cellular or environmental variables obtained 
from the bioreactor runs subjected to different RPM and LPM 
combinatorial operating conditions. 


¢ Deploy the model with the inferred parameters to predict the 
cell behaviors and performance for the full-factorial operating 
design spaces. 


e¢ This model-driven approach can be used to identify the most 
appropriate or optimal operating condition tailored to the 
desired goals (e.g., high productivity, high biomass growth, 
minimal operating cost, etc.). 


Non-ideal/suboptimal mixing condition could lead to 
non-uniform spatial distributions of the key variables like nutrients, 
dissolved oxygen, and the administrated substrate required for 
proper bioconversion and optimal cell growth. To demonstrate 
the spatial variations at different operating conditions, we integrate 
the developed cell mechanistic model with fluid dynamic model to 
provide detailed description of the local flow-field dynamics across 
the entire space of bioreactor as shown in Fig. 5. This section 
outlines the steps involved in preparing the bioreactor geometry, 
integrating the cell model as transport functions, and setting up the 
case for simulation. 


e Fluid dynamic simulation was executed using ANSYS® Aca- 
demic CFX Release 19.1. 


e Draw the geometry of bioreactor using SolidWorks, a 
computer-aided design (CAD) and analysis tool or ANSYS 
DesignModeler or SpaceClaim. 


184 Jing Wui Yeoh and Chueh Loo Poh 


Cell kinetic model 


Ferulic Acid Vanillin 
(substrate) (product) 
Pathway Level Model 
ASM 


Integrated model 


Fig. 5 Integrated model development after coupling cell kinetic model to the bioreactor fluid dynamic model, 
which enables one to visualize the spatial distribution profiles of different cellular variables across the whole 
bioreactor. Top panel: Detailed cell mechanistic model. Bottom panel: Bioreactor geometry, generated mesh 
for simulation, and fluid flow vector field. (Parts of the figure adopted from [6] with permission) 


¢ Create the three parts/domains of bioreactor for fluid dynamic 
simulation: main bioreactor domain, rotating domain, and an 
injection domain. 


e Assemble the parts together by imposing the necessary con- 
straints with the rotating domain defined by moving reference 
frame methodology. 


¢ Generate the tetrahedral mesh for the geometry required for 
simulation in which the element count is approximately five 
times larger than the number of nodes. 


e Proceed to set up the settings for simulation such as Analysis 
Type and Solver. 


¢ Choose Transient Analysis to run the time-response study and 
set the Total Time Duration, the respective Time Steps, and the 
Initial Time point. 

e¢ Set the proper Solution Units: [kg] for Mass Units; [m] for 
Length Units; and [s] for Time Units in our case. As 1 g/L is 
equivalent to 1 kg/m*. 


¢ Choose second-order backward Euler option under Solver 
Control. 


Integrated Model-Driven Approach for Bioprocess Design 185 


Under the Results tab of Output Control, choose the Mass 
Fraction for each of the variables and other components like 
Shear Strain Rate or Total Pressure, etc. 


Under the Trn Results tab of Output Control, add Transient 
Results and set the Time Interval (0.1 s for our case). 


Insert the different cell variables (cell biomass, oxygen, carbon 
dioxide, mRNA, and protein for two enzymes Fes and Ech, 
glycerol, M9 medium, ferulic acid, feruloyl-CoA, and vanillin) 
as new materials as Pure Substance and set their Thermody- 
namic State under Basic Settings and Molar Mass and Density 
under Material Properties. 


Include an additional material as Variable Composition Mixture 
and select all the defined materials earlier under Materials List 
and use Liquid as the Thermodynamic State. 


Insert all the cell model formulations into the Expressions after 
converting to mass fraction through dividing by total density. 


Include additional expressions to calculate the average of mass 
fraction for each component from different domains (may 
exclude the injection domain as it only occupies a small area/ 
volume which is negligible compared to other domains). 


Insert an expression to signify the cell inoculation at the top of 
the bioreactor (the inlet boundary of injection domain) using a 
step function following a mass flow rate of 0.1 g/s for a 2 s 
injection duration. 


Under the Flow Analysis, create a new boundary condition as 
Inlet and set the Boundary Type to Inlet and choose the Loca- 
tion to be the top surface of the injection domain. 


Under the Boundary Details, choose the Mass Flow Rate for 
Mass and Momentum, insert the defined variable name of the 
cell injection expression (the step function mentioned earlier) as 
the Mass Flow Rate, and choose Normal to Boundary Condi- 
tion as the Flow Direction. 


For the Component Details, select the cell biomass variable and 
set the Mass Fraction to 0.1 for the cell inoculation/injection 
process, whereas other variables are set to zeros. 


Under the Flow Analysis for the other domains, choose their 
respective locations, set to Fluid Domain, add new Fluid Defi- 
nition and define mix as Material, and set Continuous Fluid 
under Morphology, Reference Pressure of 1 atm with 
Non-Buoyant Model. 


For the rotating domain, to set the rotational speed, choose 
Rotating under Domain Motion and set the respective revolu- 
tion per min (100 RPM, 150 RPM, 225 RPM, 400 RPM for 
different biotransformation run cases) under Angular Velocity, 
and set the Rotation Axis (Global Y for our case). 


186 Jing Wui Yeoh and Chueh Loo Poh 


29 CFD Results 
Visualization 


For the Fluid Models tab under Flow Analysis, set None (Lami- 
nar) under Turbulence for angular velocities lower than 
200 RPM and choose k-Epsilon for the other velocities (based 
upon the computed Reynolds number (Re) for the stirred-tank 
bioreactor given the fluid density, diameter of the impeller, 
dynamic viscosity of the fluid, and rotational speed for which 
the system is deemed to be fully turbulent for values of 
Re > 10,000). 


Choose Transport Equation for all the components under 
Component Models at the Fluid Models tab. 


For Initialization tab, set the initial values for glycerol and 
oxygen under Automatic with Value Option for simulations 
when cells were inoculated from the top of the bioreactor. 


Set the initial values for cell biomass as well for running simula- 
tions for the full bioconversion process. 


After setting up the case under Setup, proceed to the Solution 
section to define the Run Mode whether it is Serial or Parallel 
and choose Current Solution Data (if possible) for the Initiali- 
zation Option. 


Save the case file and submit the definition (.def file) to be run 
on any servers or workstations as the simulation is computa- 
tionally intensive and time-consuming. It is recommended to 
test running the case for a short period of time to ensure that 
the case is setup properly without any unintended errors. 


This section highlights the procedures involved in analyzing and 
visualizing the results (time response, spatial distribution, and ani- 
mation video) obtained from the CFD simulations. 


Once the simulations have completed, a result (.res) file and a 
folder containing the details will be generated. 


The .res file can be loaded into CFD-Post for visualizing the 
simulation results. 


Create a plane under the Location dropdown menu and choose 
the XY Plane which represents the middle plane of the bioreac- 
tor in vertical view. 


Create a contour plot or vector plot to view the concentration 
spatial distribution profiles or velocity vector field to identify the 
different flow patterns. 


Choose the created plane as Locations and choose the 
corresponding mass fraction as Variable and set the Range to Local. 


Set the other features like number of Contours for resolution, 
or settings under Labels and Render based on preferences. 


Use Timestep Selector to view the spatial distribution profile 
across the plane at specific time point. 


Integrated Model-Driven Approach for Bioprocess Design 187 


e To generate video of the changes of spatial distributions across 
time, use Animation and choose Timestep Animation, set Cur- 
rent Timestep to the starting time point, click on Save Movie 
and select the path to save the video, and Play the Animation to 


initiate the process. 


e To view the time series data, click on the chart, choose the plot 
Type (XY — Transient or Sequence) under the Data Series tab, 
select Expression, and choose the corresponding variable that 
represents the computed average value for the different species 
from the dropdown menu. 


e Export the data as CSV file for external plotting. 


Acknowledgments 


We would like to thank the support from Singapore National 
Research Foundation (NRF 2015 NRF-POC002-030), the Syn- 
thetic Biology Initiative of the National University of Singapore 
(DPRT/943/ 09/14), Summit Research Programmes of the 
National University Health System (NUHSRO/2016/053/ 
SRP/05), and NUS Startup Grant (R-397-000-257-133). 


References 


1 


. Xia J, Wang G, Lin J et al (2016) Advances and 


practices of bioprocess scale-up. Adv Biochem 
Eng Biotechnol 152:137-151 


. Wang G, Tang W, Xia J et al (2015) Integration 


of microbial kinetics and fluid dynamics toward 
model-driven scale-up of industrial bioprocesses. 
Eng Life Sci 15:20-29 


. Zhang H, Zhang K, Fan S (2009) CFD simula- 


tion coupled with population balance equations 
for aerated stirred bioreactors. Eng Life Sci 9: 
421-430 


. Lapin A, Klann M, Reuss M (2010) Multi-scale 


spatio-temporal modeling: lifelines of microor- 
ganisms in bioreactors and tracking molecules in 
cells. Adv Biochem Eng Biotechnol 121:23-43 


. Elqotbi M, Vlaev SD, Montastruc L et al (2013) 


CED modelling of two-phase stirred bioreaction 


systems by segregated solution of the Euler— 
Euler model. Comput Chem Eng 48:113-120 


. Yeoh JW, Jayaraman SS, Tan SG et al (2021) A 


model-driven approach towards rational micro- 
bial bioprocess optimization. Biotechnol Bioeng 
118:305-318 


. Lee EG, Yoon SH, Das A et al (2009) Directing 


vanillin production from ferulic acid by 
increased acetyl-CoA consumption in recombi- 
nant Escherichia coli. Biotechnol Bioeng 102: 
200-208 


. Yan L, Chen P, Zhang S et al (2016) Biotrans- 


formation of ferulic acid to vanillin in the packed 
bed-stirred fermentors. Sci Rep 6:34644 


. Sarkar J, Shekhawat LK, Loomba V et al (2016) 


CFD of mixing of multi-phase flow in a bioreac- 
tor using population balance model. Biotechnol 
Prog 32:613-628 


Check for 
updates 


Modeling Subcellular Protein Recruitment Dynamics 
for Synthetic Biology 


Kwabena A. Badu-Nkansah, Diana Sernas, Dean E. Natwick, 
and Sean R. Collins 


Abstract 


Compartmentalized protein recruitment is a fundamental feature of signal transduction. Accordingly, the 
cell cortex is a primary site of signaling supported by the recruitment of protein regulators to the plasma 
membrane. Recent emergence of optogenetic strategies designed to control localized protein recruitment 
has offered valuable toolsets for investigating spatiotemporal dynamics of associated signaling mechanisms. 
However, determining proper recruitment parameters is important for optimizing synthetic control. In this 
chapter, we describe a stepwise process for building linear differential equation models that characterize the 
Kinetics and spatial distribution of optogenetic protein recruitment to the plasma membrane. Specifically, 
we outline how to construct (1) ordinary differential equations that capture the kinetics, efficiency, and 
magnitude of recruitment and (2) partial differential equations that model spatial recruitment dynamics and 
diffusion. Additionally, we explore how these models can be used to evaluate the overall system perfor- 
mance and determine how component parameters can be tuned to optimize synthetic recruitment. 


Key words Mathematical modeling, Signal transduction, Protein dynamics, Diffusion, Optogenetics, 
Plasma membrane, Localization, Compartmentalization, Protein recruitment, iLID 


1. Introduction 


The cellular cortex is a primary site of signaling where dynamic 
protein and lipid scaffolds guide signaling networks to control 
essential cellular behaviors [1]. Signal processing at the plasma 
membrane occurs through multiple classes of mechanisms 
including local modification of cortical proteins, creation of lipid 
subdomains that directly recruit protein effectors, and activation of 
scaffold proteins that promote signal complex formation. In many 
cases, these processes can be hijacked by controlling the localization 
of specific pathway components. As a result, a number of engi- 
neered strategies that mimic primary modes of protein recruitment 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_10, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


189 


190 Kwabena A. Badu-Nkansah et al. 


A Receptor Recruitment CRY2 Clustering 


) 
O O 
Ce Sea te cee sttTtti tn 
Mp 
F a a i | ee 
=e ~ a é oN se 
@ B = GC Activation 0 \| le 


B) Protein-Membrane Association iLID Recruitment 


Membrane 
Recruitment 


Formation 


(ws) D Signal Complex 


Fig. 1 An assortment of signaling mechanisms for plasma membrane recruitment coupled with example 
design principles of associated optogenetic systems. (a) Left, cortical protein recruitment by receptor 
activation and clustering. Right, synthetic activation driven by CRY2 optogenetic receptor clustering. (b) 
Left, direct associations between plasma membrane lipid domains and lipid binding proteins. Right, local 
protein recruitment to plasma membrane domains after synthetic enrichment of signaling lipids using iLID 
optogenetic recruitment of a lipid-modifying enzyme. (c) Left, signaling complex formation downstream of an 
activated receptor. Right, synthetic production of signaling complexes through direct stimulation of optoge- 
netic opsin receptors 


to the plasma membrane have emerged as complementary toolsets 
in synthetic biology for investigating compartmentalized dynamics 
of signal transduction (Fig. 1). 

Engineered control of protein localization typically uses chem- 
ical [2-5 | and/or light-inducible [6-8] strategies. In general, these 
tools rely on tagging a target signal regulator and its binding 
partner(s) separately with genetically encoded affinity domains 
whose association requires exogenous activation. These approaches 
can also be adapted to locally recruit constitutively active regulators 


Modelling Protein Recruitment Dynamics 191 


to cellular compartments of interest that house important effectors 
[6]. Accordingly, exogenous recruitment of target proteins to the 
cell cortex can be achieved by anchoring one dimerization compo- 
nent to the inner leaflet of the plasma membrane. This strategy has 
been employed to selectively activate and recruit Rho GTPases 
[9, 10], control spindle positioning [11], investigate lipid regula- 
tion of ion channels [12], and decipher actin-mediated phosphoi- 
nositide 3-kinase feedback during cell polarization [13, 14]. In 
addition to activating downstream signaling, synthetic recruitment 
strategies have also been used for inhibitory roles by sequestering 
protein regulators away from their signaling niches [15]. The 
increasing diversity of synthetic strategies for protein recruitment 
and control has immense potential for elucidating complex signal- 
ing networks. However, these systems are built from biochemical 
components bound by the laws of chemistry and physics. Often 
their behaviors in real cells do not match the cartoon models that 
we draw based on their design, and system responses can be variable 
from cell to cell and from day to day. Computational methods 
provide a natural complement for these approaches by assessing 
how component features can be tuned to elicit desired dynamics. 
When component parameters are known or can be empirically 
estimated, mathematical models can become powerful tools that 
offer predictability and insights into how biochemical and physical 
constraints affect system performance. They can be particularly 
useful for characterizing the kinetics and spatial patterns of compo- 
nent outputs after compartmentalized recruitment. 

Here, we describe a stepwise approach to construct and apply 
mathematical models that characterize the kinetics and spatial dis- 
tribution of protein recruitment [7]. We describe the construction 
of a system of ordinary differential equations (ODEs) to analyze the 
temporal dynamics of protein recruitment and partial differential 
equations (PDEs) that incorporate spatial patterns. Such 
ODE/PDE- models have been useful in profiling 
membrane-associated processes including EGF receptor-mediated 
MAPK signaling [16], optogenetic membrane anchoring [7], sig- 
nal transmission from compartmentalized Ras GTPase nanoclusters 
[17], and membrane-associated Rho GTPase cycling [18-20]. To 
illustrate this approach, we specifically focus on a two-component 
ODE model that encompasses the dynamics of local recruitment of 
a protein species to the plasma membrane. We also derive an 
associated PDE model that incorporates spatial conditions, symme- 
try features, and the effect of diffusion on spatial association pat- 
terns. We describe how to compute these models using MATLAB; 
however, similar computational strategies can be implemented 
using other programming languages. 


192 Kwabena A. Badu-Nkansah et al. 


2 Materials 


2.1 Personal 
Computer 


3 Methods 


3.1 Modeling 
Recruitment Kinetics 
and Endpoint 
Dynamics 


A programming platform such as MATLAB equipped with algo- 
rithmic solvers for systems of ODEs and/or 1D PDEs. 


A key challenge in designing synthetic recruitment systems is 
achieving high levels of stimulus-induced responses with low basal 
recruitment. In general, the rate of membrane recruitment and the 
rate of dissociation are critical parameters that need to be optimized 
for this goal. We recently generated models to explore these fea- 
tures for a popular optogenetic approach, the improved light- 
induced dimerization (iLID) system [7]. iLID is an engineered 
protein containing a modified LOV2 domain that, in response to 
light, exposes a peptide from E. colz SsrA capable of binding with 
high affinity to a partner SspB fusion protein (Fig. 2a) [6]. By 
anchoring iLID to the plasma membrane, this system can be used 
to concentrate target proteins of interest to local membrane sites in 
response to blue light exposure. In addition to intrinsic features 
that control component binding, including component conforma- 
tion dynamics and binding specifications of the SsrA peptide and 
SspB, recruitment performance of iLID depends on extrinsic vari- 
ables, such as component concentrations and compartmental 
anchoring, that often require empirical optimization by the user. 
However, in silico approaches can be useful for identifying para- 
meters that help guide recruitment optimization. 

To predict the behavior of such a system, we can construct 
ODE models to identify expression regimes of component species 
for which membrane recruitment is specific and efficient. Our 
simple model contains two protein species where a substrate (S) 
concentrates to the plasma membrane upon activation of the 
recruiting receptor species (R) (Fig. 2b). We consider the [iLID] 
and [SspB] components as representations of [ R] and [S], respec- 
tively, but a structurally identical model can also be used to describe 
other optogenetic approaches or, alternatively, simple systems of 
localized protein recruitment. In this model, R is bistable; it can 
exist either in an inactive state with low affinity for substrate Sor as a 
high-affinity active state, R* (Fig. 2b). It is critical to consider 
binding for both states of R, as the basal binding of S to inactive 
Rcan be a key limitation of recruitment systems at high component 
concentrations. 

Here, we describe how to outline the primary states of the 
system and build an ODE model that captures the kinetics of 
protein recruitment. After defining component species, the inter- 
action states, reaction events, and initial conditions prior to recep- 
tor activation can be determined: 


Modelling Protein Recruitment Dynamics 193 


A. Receptor Mediated Membrane Recruitment 


CUT, MTN, 


ws eo 


Protein ’ 
oS) state, Light A 
e* Es) Activation | + a) . 
or -* 7 x 
Substrate we ‘. 
o* — is 
e = s 
. - .' 
o* s 
e ie i be 
io e .\ 
B. Receptor-Substrate Interaction States 
. BHEEEEBESEEBEEEEESEEE f . BHREREESBEBEEEEESEEEE 4 
a | | | 
t | 4 
1 | i | i | 
_—_> 

r ; — P| ; 
tI i | a i | 
i | | i | i | 


ht 


| 
7. 
| a 


i | 
i | 
| 
a BEEBEEBEHEHEEEHEEEeEEeEeeee 
Fig. 2 Schematic diagrams of iLID recruitment and component interaction states. (a) Diagram illustrating 
idealized iLID and SspB interactions before and after light activation. (b) Schematic diagram depicting possible 
activation states and interaction events during membrane recruitment of substrate S by receptor A, including 
“dark state binding” in which the substrate binds to an inactive membrane receptor 


1. Define the molecular components of the system, and systemat- 
ically determine each state of the system and each possible 
transition between states. For our model, this corresponds to 
the diagram in Fig. 2b. 


2. Assign variables to the protein species and component states 
involved in recruitment. Be sure to have a variable for every 
molecular species in the model: 


194 Kwabena A. Badu-Nkansah et al. 


Protein Species ; : 
R = Free inactive receptor 


R* = Free active receptor 
S = Free substrate 
RS = Substrate bound inactive receptor 
R*S = Substrate bound active receptor 


3. Write chemical equations for each reaction in the model, 
including component interactions and transitions between pro- 
tein states. We assume that receptor activation occurs in 
response to the experimental stimulation with a single rate 
that is equal for all binding states of the receptor. For this 
example, receptor activation rate, y, will be nonzero during 
stimulation and zero otherwise: 


Interaction Events 
R+S= RS; 
Forward Rate = Ratejnactive, Bindings 
Reverse Rate = Ratetnactive, Release} 
R°+S=R'S; 
Forward Rate = Rate Active, Bindings 
Reverse Rate = Rate active, Release 
Receptor Activation Events 
R= R’; 
Forward Rate = 7; 
Reverse Rate = Ratergy 
RS = R*S; 
Forward Rate = 7; 
Reverse Rate = Raterey 


4. Define reaction constants. We can define the rate constants 
numerically using estimates based on published measurements. 
In many cases, the binding affinities (Ky) may be available, but 
the forward and reverse binding rates may not be. In this case, 
we relate both rates to the Kq and estimate the off-rate using 
published kinetic data or by calibrating the model to empirical 
measurements (see Note 1): 


Ratetnactive,Release 
Ratetnactive,Binding 


Kd inactive = 


Modelling Protein Recruitment Dynamics 195 


We will treat the forward rate of receptor activation (y) as 
an experimental input into the model, but the associated 
reverse rate can likely be empirically obtained or estimated 
from observations made in prior literature. 


5. Define mathematical versions of the rate equations for each 
species. Construct one differential equation for each species 
in the model (see Note 2). To simulate receptor activation, 
the receptor activation term, /jnput, Will depend on the external 
input at a given time. For iLID systems, 7inpur is the temporal 
profile of blue light irradiation: 


Le = Yinput * [R] + Rateactive,Release * [RS] — Rate active,Binding 
+ [R"] « [S] — Raterey * [R’] (1) 
ad ur a y= MI RS) MRA ac tnane® RIMS 
— RateActive,Release * [R*S] — RateRey * [R*S] (2) 
see = Ratetnactive,Release * [RS] + Raterey * [R*] 
— Ratejnactive,Binding * [R] * [S] — Yinput * [R] (3) 
a = Ratetnactive,Binding * [R] * [S] + Raterey * [R*S] 
— Ratetnactive,Release * [RS] — Yinpur * [RS] (4) 
sie = Ratetnactive,Release * [RS] + RateActive,Release * [R*S] 


= Ratetnactive Binding * [R] * [S] > Rate Active,Binding * [R*] 
« [S| (5) 


6. Define the initial state of the system. Prior to stimulation, we 
assume that the receptor is entirely inactive and the system is in 
steady state. Therefore, each reaction species can be defined 
using known measurements. At this initial state, variables 
[R]totra and [S],ota are defined to be constants representing 
the total concentration of the two proteins. Additionally, con- 
servation of mass can be used to relate free [| R] and [S] to the 
amount of complexed [RS]: 


[R"] = 0; 

[R°S] = 0; 

[R] = [Rhiora — [RS]; 
[S] = [Sloat — [RS]; 


196 Kwabena A. Badu-Nkansah et al. 


a[|RS] 
at 


7. Compute the concentration of RS at steady state by setting the 
rate al RS]/dt (Eq. 6) to zero: 


= Ratetnactive,Binding * [R] * [S| = Rate Inactive,Release * [RS] (6) 


0= Ratetnactive,Binding * [R] * [S| = Ratetnactive,Release 
* [RS] (At steady state) (7) 


0 = ([R] [RS]) * ([S] [RS]) —Ky*[RS] (8) 


total total 


0= [RS]? ale Reed . Sheet ~~ [RS] * eed ~~ [RS] * [Sl otal 
— Kg * [RS] (9) 


0= [RS}? = ([Rh coral zs [S} otal + K4) * [RS] 
+ ([Rhotal * [Slrotat) (10) 


UPC esl + [Sl otal + Ka) si ( Le seat + Deel al Kay —4% Ul st * [Slsiat’ 
2 


[RS] = 


(11) 


b— Je =4x ([R] cotal 7 [Sleotat) 
2 
= Pel ged + [Slrotal + Ka (11’) 


[RS] 


8. Now that rate equations for each reaction species and initial 
conditions are defined, reaction kinetics can be computed and 
the ODEs solved algorithmically. We have customarily used the 
MATLAB ODE solver function, ode45, for this; however, simi- 
lar algorithmic solvers of ODEs can be found in many other 
programming languages (see Note 3). 


3.2 Analyzing This approach can be customized to simulate kinetic dynamics of a 

Efficiency of variety of synthetic recruitment strategies by using different choices 

Recruitment of component concentrations, dissociation constants, and activa- 
tion /inactivation rates of the receptor (see Note 4). Here, we will 
briefly display example model performance measures using values 
determined for iLID-mediated recruitment. As a general note, we 
suggest simulating the model for a few choices of concentrations 
first to verify that the output looks reasonable and to gain an 
intuition for the model: 


1. The computed system of ODEs can be evaluated by analyzing 
notable features of model outputs. For example, basal recruit- 
ment, maximum recruitment, fold recruitment, and dissocia- 
tion each designate performance measures that help guide 
optimization strategies for efficient synthetic recruitment 
(Fig. 3). Basal recruitment can be computed from the initial 
steady state of the model, while other measures can be 


Modelling Protein Recruitment Dynamics 197 


y a = Basal recruitment 
b = Max recruitment 
(atb)/a = Fold enrichmenet 
c = t,,, for dissociation 


Membrane Recruited [S] 
([RS] + [R*S]) 


1, 
9 


Time 


Fig. 3 Representative plot of recruitment kinetics after receptor activation. 
Depiction of an example kinetic profile of recruitment after temporary receptor 
stimulation. Illustrated here are measures of recruitment dynamics captured in 
ODE/PDE models including basal recruitment, max recruitment, fold enrichment, 
and t,/2 of dissociation 


determined from the simulated model output (Fig. 4a). Max 
absolute recruitment can be determined from the steady-state 
solution during extended system input, similar to how initial 
conditions were calculated in the previous section (Fig. 4d). 
Fold recruitment can be calculated using computed values for 
basal and max recruitment (Fig. 4g). Kinetic parameters such as 
ty /2 Of dissociation are computed from temporal profiles for 
simulations involving transient system input (Fig. 4j). After 
analyzing the model across systematic ranges of concentrations 
for each component, heat map plots can be used to visualize 
how each performance measure depends on component con- 
centrations (Fig. 4b, e, h, k). Each pixel in the heat map 
represents outputs from model simulations using specific com- 
binations of parameters (Fig. 4c, f, i, 1; Top) (see Note 5). Lastly, 
analogous plots can also be generated to visualize how perfor- 
mance depends on additional variables such as rate constants 
and component binding affinity characteristics. 


2. Model output interpretation: Ideally, performance measures 
generated from model simulations can generally predict how 
recruitment parameters and component characteristics influ- 
ence recruitment efficiency. For example, our system of ODEs 
generally predicts that both basal and maximum membrane 
recruitments scale positively with increasing component con- 
centrations (Fig. 4c, f, bottom). While this trade-off limits 
system performance, the simulation results can be used to 
understand how fold recruitment scales with component con- 
centrations (Fig. 41, Jottom) and identify ranges of component 
concentrations where system performance is most efficient. By 


198 Kwabena A. Badu-Nkansah et al. 


A) B) Basal Recruitment (uM) C) 50 
10.0 © 5 
= . 2 ii | 
oS = 
53 7 
=| 3 at = 
=| 8 = a 
9 iv = 0.02 3 
x} 2 x 2 
a “0 8 
@| 5 id 
a § 3 
n 
| SLs 0.0001 & 
Basal Time 0.01 a 


Recruitment 


o 


01 Ig) (um) 2 10. 


= 
0 


[R]=5 uM 


[R] (uM) 


0.001 


Max Recruitment 
Membrane Recruited [S] 


“T[R] = 0.5 uM 
“0.01 0. 10.0 


1 1.0 0 
[S] (uM) 0.01 0.1 Ts1 um) 1° 10.0 


2 
= 


Fold Enrichment 


IR] 
on 


ono 


20 


a = Basal recruitment 
b = Max recruitment 
(at+b)/a = Fold Enrichment 


a 
Fold Enrichment 
i 


([R*S]}#[RS])/(IRS]) 
a 


w 


I 


Fold Enrichment 
Membrane Recruited [S] 


Time 0.01 
0.01 0. 


) g 3 
= oO 
id 
EE ES) _—— ES  '! 
° wo 
8 

Max Recruited [S] (uM) IR] 

o = Non 

a Le) 


4 1.0 
[S] (uM) 


qa 
~~ 


t,,, dissociation (sec) L) 


t,. dissociation 


D 
as 


t,, dissociation 
Membrane Recruited [S] 
t,,Dissociation (sec) 


wo 
np 


Time 


0.01 


0.1 [Ss] (uM) 1.0 10.0 
Fig. 4 ODE modeling captures important features of recruitment dynamics across broad ranges of component 
expression regimes. (a, d, g, j) Example plot profiles of recruitment kinetics each illustrating a measurement 
feature captured by ODE modeling. (b, e, h, k) Heat map plots generated from ODE models displaying 
individual features of recruitment across four orders of magnitude of [A] and [S] concentrations. In these 
examples, values represent real recruitment measures predicted by ODE models constructed for iLID-SspB 
interactions. Red bars designate isolated concentration regions portrayed in c, f, i, and I. (c, f, i, l) Top, heat 
map insets of specific regions extracted from b, e, h, and k (red bars) showing differential dynamics between 
two different receptor concentrations. Bottom, line trace format of heat map insets comparing recruitment 
features at two different receptor concentrations 


3.3 Modeling Spatial 
Dynamics of 
Recruitment 


Modelling Protein Recruitment Dynamics 199 


defining threshold values for parameters of interest, an ideal 
concentration space can be determined where system perfor- 
mance exceeds each threshold. These results can be used to 
guide experimental design and troubleshoot parameters where 
synthetic recruitment is not performing as desired. Further- 
more, the model can make less intuitive predictions. For exam- 
ple, our iLID ODE model predicted that global iLID-SspB 
disassociation rates decrease with increasing iLID concentra- 
tion [7] (Fig. 41, bottom). This effect arises from newly disso- 
ciated SspB molecules being more likely to rebind at the 
membrane if surrounding levels of unbound iLID are high. 


Models generated from ODEs typically capture dynamics across a 
single dimension and are therefore suitable for determining the 
temporal evolution of reaction systems. However, in addition to 
kinetic features, cell signaling mechanisms often rely on spatially 
heterogeneous patterns. For protein interactions at the plasma 
membrane, cytoplasmic diffusion near the cell cortex and lateral 
diffusion along the membrane are important factors that influence 
spatial distributions of recruitment events. Additionally, functional 
outputs of biological signaling are often determined by spatially 
asymmetric propagation of signaling circuitry. Partial differential 
equations (PDEs) follow system dynamics across multiple indepen- 
dent variables and are useful for capturing how system components 
change in space and time. Here, we will generate a PDE model that 
illustrates how diffusion affects the spatial distribution of recruit- 
ment over time. Toward this goal, consider the general diffusion 
equation for temporal change in concentration of a chemical spe- 
cies over a one-dimensional spatial coordinate: 


2 
Ou Cy 
a = Da (12) 
Ot Ox 
where (x, ¢) represents a concentration value of species u at posi- 


tion « and time ¢. Additionally, ae represents the change of concen- 


tration of # over time, Se describes the profile of concentration 
across the spatial coordinate x, and D is the diffusion coefficient 
within the system. 

We can build our PDE model based on our previous ODE 
model through the following steps: 


1. Determine the spatial domain for the model. Using a 
one-dimensional domain simplifies computation and the 
interpretation of model results. We can modify symmetry to 
model higher-dimensional geometries using this simple 
domain. For analyzing the spatial spread of proteins on a 
two-dimensional plasma membrane, we define our spatial coor- 
dinate to be the radial distance along the membrane from the 


200 


Kwabena A. Badu-Nkansah et al. 


target recruitment site. We use symmetry conditions in the 
model, to handle the increasing area associated with larger 
radial distances. This is accomplished in MATLAB’s pdepe 
PDE solver by setting the symmetry parameter m to | for 
“cylindrical” symmetry (see Note 6). 


. Define the diffusion coefficients. The diffusion coefficients are 


the only additional parameters for this model (see Note 7). 


. Define the molecular species within the variable w. 


Importantly, #(x, ¢) can represent a single concentration species 
across the spatial coordinate « and time coordinate ¢ or, for 
signaling circuits that involve multiple reaction species, a matrix 
that incorporates all relevant species within the system: 


) 


u(x,t) = | uz]; (13) 


where y= [R], U2 = [R*], U3 = [RS], U4 = [R*S], u5 = [S]. 

For our purposes, we interpret the units for the spatial 
dimension to be in microns since this is a relevant scale for 
a cell. 


. To build a PDE, spatial boundary conditions must also be 


defined. We typically use the Neumann condition (see Note 
8), specifying that the spatial derivative is zero at the 
boundaries: 


au 
— = 0;at x = Oand & = Xpax 


ax 


. In contrast to the ODE example, we will now assume that this 


system begins with a pre-established profile of active receptors. 
This approach is useful for analyzing the spread of recruited 
components after an initial standardized input. Therefore, 
initial conditions can be adapted as follows: 

uy, = ((R] = BB) — 12; 


total 


ae ‘ 
where BB = (° ve 4 — ae} 


oy 


b= [R] otal + [Sheetal Re Ka 


u2 = ((R] BB) « r(x); 


total 


13 = BB-— U4; 


Modelling Protein Recruitment Dynamics 201 


u4 = BBx r(x); 
U5 = [Shrotat — 43 — U4; 


where 7(x) is a function that determines the spatial distribution 
of receptor activation and BB represents the basal bound state 
of receptor. For example, in optogenetic systems such as iLID, 
after global light activation, 7(x) can be set to a Gaussian profile 
peaking at «= 0, with a width determined by the resolution of 
focal stimulation for an optical microscope system. 


. For algorithmic evaluation of PDE models, programmatic 
solvers often require representing PDEs in standard organiza- 
tional forms. In MATLAB, a standard form for 1D PDE 
solvers is: 


pc Om _ ym © Pf ace 
Ox } Ot Ox Ox 


Ou 
+ s(1 t, uN, 5) (14) 
Ou 


where f (x, t, U, ) is a term for the spatial flux of the species , 
ie: t, u, a) is a source term or reaction term that, in this case, 
will incorporate binding and chemical reactions that generate 
or deplete species within ~, and c(x,¢, n, $# ) represents a 
balance coefficient. mis the symmetry constant that determines 
the type of spatial symmetry in the system; m = 0, 1, or 
2 represents Cartesian (no symmetry), cylindrical symmetry 
(azimuthal), or spherical symmetry (azimuthal and zenith) 
coordinates, respectively. 


. Referring to the initial equation for diffusion (Eq. 12), with an 
addition of the reaction term, its standard form can be 
rewritten as: 


Ou oO Ou Ou 
an re (p =") s( ws), $2) (15) 


where: 


In this system, the source term s corresponds to the same 
set of terms as the right-hand side of the equations from our 
ODEs generated previously (Eqs. 1, 2, 3, 4, and 5). 
Collectively, these equations can be written in matrix form as 


s(u(x, t), Su) (see Note 9): 


202 Kwabena A. Badu-Nkansah et al. 


Rate nactive,Release * U3 + Ratepey * U2 — Ratetnactive,Binding * UW) * U5 


S$] 


S2 


$4 
$5 


3.4 Analyzing 
Recruitment and 
Diffusional Spread 


Rate active,Release * U4 — Rate Active Binding * U2 * U5 — RateRey * “2 
Ratetnactive,Binding * Wy * U5 + RateRey * U4 — Rate fnactive,Release * 13 
Ratective,Binding * U2 * U5 — Rate active,Release * U4 — RateRey * U4 


Ratetnactive Release * U3 + Rate active,Release * U4 — Ratejnactive,Binding * U)\ * U5 


— Rate Active Binding * U2 * U5 


. At this stage, with the initial and boundary condition set, the 


PDEs can be integrated using programmatic solvers such as 
pdepe function in MATLAB (see Note 10). 


. Just as with the ODE model, measures of system performance 


should be computed. While basal recruitment can be computed 
similarly, maximal recruitment will be different since PDE 
models simulate recruitment in a local subregion of the plasma 
membrane that evolves over time (Fig. 5a). In this case, both 
dissociation and diffusional spread can reduce the local accu- 
mulation of the recruited protein. For this reason, maximal 
recruitment should be determined by following simulation 
outputs over time. Additionally, the dependence of recruit- 
ment on the concentrations of R and S will likely scale differ- 
ently from what was previously produced in the ODE model. 


. Compared to ODE simulations, the PDE model naturally pre- 


sents additional measures of system performance. Most nota- 
bly, the spread of recruitment regions can be calculated from 
the spatial recruitment profiles at given component concentra- 
tions and diffusion characteristics (Fig. 5a). 


. As with the ODE model, thresholds for the PDE system can be 


set for each measure of system performance. Regions in com- 
ponent concentration spaces that exhibit acceptable recruit- 
ment performances can be identified and subsequently used 
to inform how synthetic systems can be optimized at the bench 
(Fig. 5b, d, f). For example, we have used similarly structured 
PDE models to optimize iLID recruitment approaches. These 
PDE models profiled how customizing plasma membrane 
anchoring strategies that confer differential membrane diffu- 
sion properties to iLID receptors influence spatiotemporal 
SspB recruitment (see Notes 11 and 12). PDE modeling of 
iLID recruitment offered interesting predictions. Constraining 
receptor diffusion at the membrane by increasing membrane 
anchor size resulted in significant changes in substrate recruit- 
ment levels. This model predicted that decreasing receptor 
diffusion promoted increased maximum recruitment, fold 
recruitment, and lengthened evolution time to maximal 


Modelling Protein Recruitment Dynamics 203 


re to Max Recruitment 


Nsec. «t= 15sec t=3 t=3 sec t= 4.5 sec t=6sec t=7.5 sec 
= = D=0.1 pm’/sec 
= = D=1 pm*/sec 
= 
© 
& 
2 Max Recruitment 
ind : Basal Recruitment 
0 20 
Sn istance (um) Recruitment ——— 


B) 


D = 0.1pm?2/sec 


Oo 
Max Recruitment (uM) 


Max Recruitment 


0.001 


0.01 0.1 1.0 10.0 


= 1 um?/sec 


SI uM 


D) E 


— 


D = 0.1 pm?2/sec 


T™* Recruitment (sec) 


3 4 
[R] (uM) 


F) 


= 


S 
¢ 
ct) 
E 

x 

3 
= 
i= 

Lu 

z 
[) 

LL 


Fold Enrichment 
([RS] + [R*S])/[RS] 


Fig. 5 PDE modeling captures the effects of receptor diffusion on spatial spread and recruitment dynamics 
across broad ranges of component expression levels. (a) Time-lapse plots displaying spatial spread of 
recruitment predicted by PDE modeling. Additionally, important measures of recruitment are depicted 
including max recruitment, basal recruitment, time to max recruitment, and recruitment spread. (b, d, f) 
Heat map plots generated from PDE models displaying the effect of receptor diffusion on individual features of 
recruitment across ranges of [A] and [S] concentrations. In these examples, values represent real recruitment 
measures predicted by PDE simulations of iLID-SspB interactions with two different membrane anchors (see 
Note 10). Red bars designate isolated concentration regions portrayed in c, e, and g. (Cc, e, g) Line traces of 
heat map insets (red bars) from b, d, and f comparing the effect of changing receptor diffusion on individual 
recruitment features across a range of receptor concentrations 


204 Kwabena A. Badu-Nkansah et al. 


4 Notes 


recruitment of substrate across wide ranges of receptor and 
substrate concentrations (Fig. 5c, e, g). 


Altogether, two-component ODE and PDE models reliably 
capture fundamental features of protein recruitment dynamics. 
With proper reaction constants and measures for component fea- 
tures in hand, modeling approaches like these can be implemented 
in a_ straightforward manner. Additionally, these analytical 
approaches offer powerful predictive strength for synthetic recruit- 
ment strategies and can provide unique insights into efficient 
manipulation of compartmentalized signaling using synthetic tools. 


1. To estimate the kinetic binding and dissociation rates, one can 
make some simplifying assumptions. In many cases, binding 
affinities are determined largely by the dissociation rates. For 
simplicity, we can assume that the association rates are equal for 
inactive and active forms of R. We then can estimate or cali- 
brate the association and dissociation rates from kinetic experi- 
ments measuring the half-time for association after a strong 
light stimulus. Importantly, association rates are likely to be 
different for different optogenetic systems. For example, the 
“magnets” system was designed to have a more rapid associa- 
tion rate [21, 22]. 

2. It is essential that as differential equations are built, balance is 
maintained according to the law of conservation of mass. This 
can be checked by making sure that for each equation, all events 
that either produce or consume the target component species 
are represented. 


3. In many cases, 7input Will be a step function. These are typically 
not handled well by numerical integration algorithms such as 
ode45. A handy solution to this problem is to perform piecewise 
numerical integration. One can separately perform numerical 
integration for each time period in which /inpur is Constant, 
using the output of each round of numerical integration as 
the initial condition for the next. For example, a simple experi- 
ment where /Yinpur is 1 for the first round and then 0 thereafter 
would require two separate rounds of numerical integration 
with the second using the conditions produced by the first 
round. 


4, Accurate estimations for component concentrations, dissocia- 
tion constants, and reaction rates can improve predictive ability 
of ODE/PDE models. For modeling iLID recruitment, we use 
an assortment of values either derived empirically or approxi- 
mated using measurements from similar mechanisms. The 


10. 


Modelling Protein Recruitment Dynamics 205 


following parameter values have been useful for modeling iLID 
dynamics (with associated references): 


Total [iLID] = 0.1 pM 
Total [SspB] = 0.5 pM 
K ait(active) = 130 nM [6] 
K gpark(inactive) = 4.7 1M [6] 
Rateir Reversal = 0.02 s [21] 
Ratedianodiionts =0S5-’ [21] 


. Model outputs may evolve over time; therefore, it is important 


to verify that simulations are run over long enough time peri- 
ods to determine the correct value. 


. While cylindrical symmetry is useful for simulating spot recruit- 


ment and diffusion along a flat membrane interface, in other 
cases, it may be useful to model diffusion of a cytoplasmic 
component toward or away from the membrane in a spherical 
cell. For the latter, the symmetry parameter m, in MATLAB’s 
pdepe PDE solver, can be set to 2 for designating spherical 
(azimuthal and zenith) symmetry coordinates. 


. Diffusion coefficients can be determined empirically, for 


instance, through fluorescence recovery after photobleaching 
experiments. As a rough guide, diffusion coefficients may be 
around 10-30 pm?/s for cytoplasmic proteins, 0.5—1 jum?/s for 
lipid-anchored proteins, and 0.03-0.1 m7/s for transmem- 
brane proteins. 


. The Neumann boundary condition specifies that the spatial 


derivative of a system is constant at its boundaries. By setting 
the derivative to zero at each boundary, the resulting condition 
can be thought of as a “reflecting” boundary which maintains 
the flux of model components within the spatial barriers of the 
system. Therefore, under this condition, there is no passage of 
molecular species in or out of the system through the boundary 
which helps ensure conservation of mass. 


. Note that the source term s for the PDEs encompasses kinetic 


parameters and interaction states of component species. To also 


incorporate light input, s(x, tu, on) can include a Yinpur term 
that designates a temporal profile of blue light activation such 


as in Eqs. 1, 2, 3, 4, and 5. 

Note that while PDE solvers in other programming languages 
may have similar requirements for initial conditions, boundary 
conditions, source, and flux terms, they may require different 
organizational formats for proper implementation. 


206 


Kwabena A. Badu-Nkansah et al. 


11. For this example, we implemented a PDE model of iLID 
diffusion where membrane diffusion coefficients for iLID- 
CAAX (short anchor) and Stargazin-iLID (long multipass 
anchor) were estimated to be 1 pm*/s and 0.1 jim?/s, respec- 
tively, based on observations from previous studies [23, 24]. 


12. 


Custom MATLAB code for implementing both ODE and 


PDE models designed for iLID recruitment can be found at 
https: //github.com/srcollins/Code_for_iLID_Recruitment_ 
from_Springer-Protocol-Chapter-2022. 


We would like to thank the following funding sources for their 
support of this work. SRC thanks the NIH for the Director’s New 
Innovator Award DP2 HD094656. DEN thanks the NIH for the 
Fellowship F31 HL152621, DS thanks the NIH T32 GM099608 
training program, while KABN thanks the NIH T32 CA108459 


Acknowledgement 
training program for support. 
References 
1. Grecco HE, Schmick M, Bastiaens PIH (2011) 8 


ow 


Signaling from the living plasma membrane. 
Cell 144:897-909 


. Ho SN, Biggar SR, Spencer DM et al (1996) 


Dimeric ligands define a role for transcriptional 
activation domains in reinitiation. Nature 382: 
822-826 


. Spencer DM, Wandless TJ, Schreiber SL et al 


(1993) Controlling signal transduction with 
synthetic ligands. Science 262:1019-1024 


. Komatsu T, Kukelyansky I, McCaffery JM et al 


(2010) Organelle-specific, rapid induction of 
molecular activities and membrane tethering. 
Nat Methods 7:206-208 


. Miyamoto T, DeRose R, Suarez A et al (2012) 


Rapid and orthogonal logic gating with a 
gibberellin-induced dimerization system. Nat 
Chem Biol 8:465-470 


. Guntas G, Hallett RA, Zimmerman SP et al 


(2015) Engineering an improved _ light- 
induced dimer (iLID) for controlling the local- 
ization and activity of signaling proteins. Proc 
Natl Acad Sci U S A 112:112-117 


. Natwick DE, Collins SR (2021) Optimized 


iLID membrane anchors for local optogenetic 
protein recruitment. ACS Synth Biol 10:1009- 
1023 


10. 


11. 


12. 


13. 


14. 


. Zimmerman SP, Hallett RA, Bourke AM et al 


(2016) Tuning the binding affinities and rever- 
sion kinetics of a light inducible dimer allows 
control of transmembrane protein localization. 
Biochemistry 55:5264-5271 


. O'Neill PR, Castillo-Badillo JA, Meshik X et al 


(2018) Membrane flow drives an adhesion- 
independent amoeboid cell migration mode. 
Dev Cell 46:9-22.¢4 

O’Neill PR, Kalyanaraman V, Gautam N 
(2016) Subcellular optogenetic activation of 
Cdc42 controls local and distal signaling to 
drive immune cell migration. Mol Biol Cell 
27:1442-1450 

Okumura M, Natsume T, Kanemaki MT et al 
(2018) Dynein-dynactin-NuMA clusters gen- 
erate cortical spindle-pulling forces as a multi- 
arm ensemble. Elife 7:e36559 

Suh B-C, Inoue T, Meyer T et al (2006) Rapid 
chemically induced changes of PtdIns(4,5)P2 
gate KCNQ ion channels. Science 314:1454— 
1457 

Inoue T, Meyer T (2008) Synthetic activation 
of endogenous PI3K and Rac identifies an 
AND-gate switch for cell polarization and 
migration. PLoS One 3:e3068 

Graziano BR, Gong D, Anderson KE et al 
(2017) A module for Rac temporal signal 


15. 


16. 


17. 


18. 


19. 


integration revealed with optogenetics. J Cell 
Biol 216:2515-2531 


Robinson MS, Sahlender DA, Foster SD 
(2010) Rapid inactivation of proteins by 
rapamycin-induced rerouting to mitochondria. 
Dev Cell 18:324-331 


Huang CY, Ferrell JE (1996) Ultrasensitivity in 
the mitogen-activated protein kinase cascade. 
Proc Natl Acad Sci U S A 93:10078-10083 


Tian T, Harding A, Inder Ket al (2007) Plasma 
membrane nanoswitches generate high-fidelity 
Ras signal transduction. Nat Cell Biol 9:905- 
914 


Goryachev AB, Pokhilko AV (2006) Computa- 
tional model explains high activity and rapid 
cycling of Rho GTPases within protein com- 
plexes. PLoS Comput Biol 2:e172 

Jilkine A, Marée AFM, Edelstein-Keshet L 
(2007) Mathematical model for spatial segre- 
gation of the Rho-family GTPases based on 
inhibitory crosstalk. Bull Math Biol 69:1943- 
1978 


Modelling Protein Recruitment Dynamics 


20. 


21. 


22. 


23. 


24. 


207 


Mori Y, Jilkine A, Edelstein-Keshet L (2008) 
Wave-pinning and cell polarity from a bistable 
reaction-diffusion system. Biophys J 94:3684— 
3697 

Benedetti L, Barentine AES, Messa M et al 
(2018) Light-activated protein interaction 
with high spatial subcellular confinement. 
Proc Natl Acad Sci U S A 115:E2238—-E2245 
Kawano F, Suzuki H, Furuya A et al (2015) 
Engineered pairs of distinct photoswitches for 
optogenetic control of cellular proteins. Nat 
Commun 6:6256 

Yanagawa M, Hiroshima M, Togashi Y et al 
(2018) Single-molecule diffusion-based esti- 
mation of ligand effects on G protein-coupled 
receptors. Sci Signal 11(548):eaaol1917 

Hoze N, Nair D, Hosy E et al (2012) Hetero- 
geneity of AMPA receptor trafficking and 
molecular interactions revealed by superresolu- 
tion analysis of live cell imaging. Proc Natl 
Acad Sci U S A 109:17052-17057 


® 
ae Chapter 11 


Genome-Scale Modeling and Systems Metabolic 
Engineering of Vibrio natriegens for the Production 
of 1,3-Propanediol 


Ye Zhang, Dehua Liu, and Zhen Chen 


Abstract 


The fastest-growing bacterium Vibrio natriegens is a highly promising next-generation workhorse for 
molecular biology and industrial biotechnology. In this work, we described the workflows for developing 
genome-scale metabolic models and genome-editing protocols for engineering Vibrio natriegens. A case 
study for metabolic engineering of Vibrio natriegens for the production of 1,3-propanediol was also 
presented. 


Key words Vibrio natriegens, Systems metabolic engineering, 1,3-Propanediol 


1 Introduction 


Vibrio natriegens is a moderately halophilic, gram-negative, and non- 
pathogenic microorganism isolated from salt marshes [1]. It is the 
fastest-growing microorganism identified thus far, with a minimal gen- 
eration time between 7 and 10 min [2]. More importantly, Vibrio 
natriegens has an exceptionally high glucose uptake rate in minimal 
media under both aerobic (~3.90 + 0.08 g/g/h) and anaerobic con- 
ditions (~7.81 + 0.08 g/g/h) [3]. The inherent properties of Vibrio 
natriegens make it a promising platform for next-generation biotech- 
nology [4, 5]. Recently, applications of Vibrio natriegens in molecular 
biology for vector construction, protein expression, and cell-free syn- 
thesis have been widely explored and well established [1, 6-10]. Appli- 
cation of Vibrio natriegens for the production of chemicals, such as 
2,3-butanediol, 1,3-propanediol, L-alanine, and melanin, was also 
demonstrated, highlighting its potential for industrial biotechnology 
[3-5, 11]. The development of system biology tools and genome 
editing technologies also significantly accelerated the exploration and 
development of V. natriegens as a new industrial chassis [5, 12-14]. 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_11, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


209 


210 Ye Zhang et al. 


2 Materials 


2.1 Data Sources 
and Software 


2.2 Strain and Media 
Recipes 


The construction and application of genome-scale metabolic 
models (GEMs) of industrial chassis are important for systemati- 
cally investigating and predicting its metabolic characteristics and 
physiological properties and are widely used in computational biol- 
ogy, systems metabolic engineering, and synthetic biology applica- 
tions [15, 16]. To the best of our knowledge, the reconstruction of 
GEM of V. natriegens has not been reported to date. 

The development of efficient genome-editing tools is essential 
for constructing microbial cell factories. Vibrio species can actively 
take up exogenous DNA from the environment and integrate it 
into their genome via natural transformation [12, 17]. By combin- 
ing natural transformation with the expression of the competence 
regulator TfoX and the FLP/FRT recombination system, we have 
developed an efficient approach for multiplex genome editing of 
V. natriegens [5]. 

Here, we will present the workflow for generating a high- 
quality GEM of V. natriegens and a genome-editing protocol for 
engineering V. natriegens. A case study of the systematic metabolic 
engineering of V. natriegens for the production of 1,3-propanediol 
is also demonstrated. 


The methodologies and protocols for generating the GEMS of 
microorganisms have been published previously [18, 19]. Data 
sources and software used for the reconstruction of GEMS are 
listed in Table 1. We will present the workflow for generating a 
high-quality GEM of V. natriegens based on AutoKEGGRec, which 
is an efficient tool for the generation of draft models [20]. Auto- 
KEGGRec is a user-friendly tool to create draft models based on the 
MATLAB platform and KEGG database. It is compatible with the 
COBRA toolbox and convenient for the conversion of model files 
to SBML [20]. The reconstruction of GEM of V. natriegens con- 
sists of five stages. 


. Vibrio natriegens ATCC14048. 
. IPTG. 
. Rhamnose. 


. LBv2 medium: LB broth supplemented with v2 salts (200 mM 
NaCl, 23.14 mM MgCl, and 4.2 mM KCl). 


5. BHIv2 medium: 37 g/L BHI and v2 salts. 


6. Electroporation buffer: 232.8 g/L sucrose and 1.22 g/L 
KGHPO4 (pH 7.0). 


7. Ocean salt medium: ocean salt 28 g/L. 


mB ow NHN 


8. Antibiotics: kanamycin (100 pg/mL), spectinomycin (100 pg/ 
mL), ampicillin (100 pg/mL), chloramphenicol (5 pg/mL). 


Table 1 


Data sources used for the reconstructions of GEMs 


Name Link References 
Genome and genomic annotation databases 
Annotation of MIcrobial Genes http: //www.genoscope.cns.fr/agc/tools /amigene/ [34] 
index.html 
BAR https: //bar.biocomp.unibo.it/bar3/ [35] 
Bioconductor https: //bioconductor.org/ [36] 
Genomes OnLine Database https: //gold.jgi.doe.gov/ [37] 
KBase https: //www.kbase.us/develop/ [21] 
KEGG Automatic Annotation _ https://www.genome.jp/kegg/kaas/ [38] 
Server 
NCBI Entrez Gene http://www.ncbi.nlm.nih.gov/sites /entrez [39] 
RAST https: //rast.nmpdr.org/ [40] 
Biochemical databases 
BRENDA https: //www.brenda-enzymes.info/ [41] 
KEGG https: //www.genome.jp/kegg/ [42] 
ModelSEED https: //modelseed.org/ [22] 
pKa DB http://www.acdlabs.com/products/phys_chem_lab/ 
PubChem http: //pubchem.ncbi.nlm.nih.gov/ [43] 
Transporter Classification http: //www.tcdb.org/ [44] 
Database (TCDB) 
TransportDB http://www.membranetransport.org/transportDB2/ [45] 
index.html 
UniProt https: //www. UniProt.org/ [46] 
Protein localization databases 
BASys http: //basys.ca/ [47] 
PSORT https: //www.psort.org/psortb/ [48] 
Reconstruction resources and software 
AutoKEGGRec https: //github.com/emikar/AutoKEGGRec [20] 
COBRA https: //opencobra.github.io/cobratoolbox/stable/ [28] 
KBase https: //www.kbase.us/develop/ [21] 
MATLAB https: //www.mathworks.com/products/MATLAB. 
html 
Merlin https://merlin-sysbio.org/ [23] 
MetaCyc https://metacyc.org/ [24] 
ModelSEED https: //modelseed.org/ (22) 
OptFlux http://www.optflux.org/ 
RAVEN https: //github.com/SysBioChalmers/RAVEN 25)| 


212 Ye Zhang et al. 


2.3 Plasmid and DNA 
Cassette 


9. Fermentation medium: KH,PO, 1.0 g/L, yeast extract 5 g/L, 
(NH4)2SO4 5 g/L, NaCl 15 g/L, glycerol 50 g/L, 
MgSO4-7H20 1 g/L, CoCl)-6H20 0.01 g/L, MnCl)-4H,O 
0.01 g/L, FeSO4-7H20 0.01 g/L, vitamin Bj, 0.005 g/L. 

10. Fermentation feeding medium: 600 g/L glycerol and 10 g/L 
yeast extract. 


Plasmid pXMJ19-tfoX consists of an IPTG-inducible competence 
regulator gene tfoX allowing Vibrio cells to become competent and 
a rhamnose-inducible flp gene to remove selection markers with 
FRT sites [5, 12]. 

The recombinant DNA cassettes used for homologous recom- 
bination and gene editing are obtained by overlap extension PCR 
containing the upstream fragment (~3000 bp) of the target gene, 
the selection marker gene with two FRT loci, and the downstream 
fragment (~3000 bp) of the target gene. The selection marker 
genes could be resistance genes to kanamycin (Kan), spectinomycin 
(Spec), ampicillin (Amp), etc. 


3 Method 


3.1. Genome-Scale 
Modeling 


3.1.1 Draft 
Reconstruction 


This section aims to provide a detailed guide for the construction 
of genome-scale metabolic model of V. natriegens (Fig. 1) [5]. 


The main task of this stage is to obtain genome annotation 
information and biochemical information, including metabolite 
candidate and metabolic reaction information, for the reconstruc- 
tion of a draft metabolic model. Since the reconstruction and 
application of GEMs mainly rely on the biochemical data and 
metabolic reactions of the draft, the quality and reliability of the 
genome annotation information are critical to the quality of the 
reconstruction. Therefore, it is important to acquire the latest and 
most creditable genome annotations. This stage could be accom- 
plished automatically through many advanced software or websites, 
including MetaDraft, MetaCyc, RAVEN, Merlin, ModelSEED, 
KBase, AuttoKEGGRec, etc. [20-27]. 

AutoKEGGRec is an algorithm designed to interact with the 
COBRA toolbox based on MATLAB platform. After proper con- 
figuration, it can directly obtain all metabolites, reactions, genes, 
annotations, and gene-protein-reaction rules based on the KEGG 
database for the target microorganism by a simple command: 


outputStruct = AutoKEGGRec(KEGG organism IDs). 


For V. natriegens, the corresponding command is vva. Thus, 
the following command can be used to get the draft model: 


outputStruct = AutoKEGGRec(vna) 


16.Add exchange reactions. 


19. Add sink reactions. 


17.Determine and add biomass reaction. 


1 
1 
1 
' 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
' 
1 
1 
1 
11.Add metabolite identifier and related notes. ‘ 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
18.Add ATP-maintenance reaction. 
1 

1 

1 

1 

1 

! 

1 

1 


Metabolic Engineering of Vibrio natriegens 213 


Stage 1 Draft reconstruction 

1. Obtain genome information and genome annotation. 
2. Obtain candidate metabolite and metabolic reaction. 
3. Assemble draft reconstruction. 


Stage 2 Refinement of reconstruction 


4. Determine and verify substrate and cofactor usage. Stage 4 Network evaluation 

5. Determine the charged formula. 24. Test if network is mass- and charge balanced. 

6. Calculate reaction stoichiometry. 25. Identify metabolic dead-ends. 

7. Determine reaction directionality. 26. Identify and fix gaps. 

8. Add information for gene and reaction localization. 27.Adjust the simulation constraints for specific conditions. 

9, Add subsystems information. 28.Test if biomass precursors can be produced in specific medium. 

10. Verify gene-protein-reaction association. 29.Compare predicted physiological properties with known properties. 


12.Repeat steps 4 to 11 for all genes. 

13.Add spontaneous reactions to the reconstruction. 

14. Add extracellular and periplasmic transport reactions. 
15.Add intracellular transport reactions. 


Stage 3 Reconstruction of mathematical model 
21.Configure the COBRA toolbox. 

22.Load reconstruction into MATLAB. 

23.Set objective function and suitable simulation constraints. 


20.Determine growth medium requirements. 


Stage 5 Data assembly and dissemination 
30.Print MATLAB model content. 

31.Add gap information to the reconstruction output. 
32.Simulation and analysis. 


Fig. 1 Brief overview of iterative reconstruction of a genome-scale metabolic model. The general procedure is 


referenced from [19] 


3.1.2 Reconstruction 
Refinement 


It should be noted that the draft obtained in the first stage might be 
incomplete and contain many errors, including uncertain cofactor 
preference, inaccurate reaction stoichiometry, and missing reac- 
tions and genes. These issues need to be carefully calibrated. In 
addition, the localization of enzymes should also be determined 
and subsequently contribute to the addition of transport reactions 
and exchange reactions. Metabolite identifiers, related references, 
and notes also need to be added to improve the readability and 
compatibility of the model. 

Moreover, the formula of the biomass reaction should be esti- 
mated or determined, which plays an important role in in silico 
simulation [19, 26]. The biomass reaction formula consists of all 
known components and their fractional contributions to the overall 
cellular biomass, including protein, RNA, DNA, lipids, lipopoly- 
saccharides, peptidoglycan, glycogen, polyamines, etc. The 
growth-associated ATP maintenance (GAM) reaction and the 


214 Ye Zhang et al. 


3.1.3 Reconstruction of 
the Mathematical Model 


3.1.4. Network Evaluation 


3.1.5 Data Assembly and 
Dissemination 


3.2 Gene-Editing 
Protocol (Fig. 2) 


3.2.1 Introduction of 
Plasmid pXMJ19-tfox 


nongrowth-associated ATP maintenance (NGAM) reaction, which 
account for the energy necessary for cell replication or maintaining 
the cell, respectively, should also be determined by chemostat 
growth experiments or estimated according to the available 
literature [19]. 

Curation and refinement are important to reconstruct a high- 
quality GEM. Detailed steps can be found in Fig. 1. This stage 
could be accomplished by employing biochemical databases and 
software, but the manual evaluation is still indispensable. Databases 
including NCBI, SEED, KEGG, BRENDA, TransportDB, Uni- 
Prot, etc. could be helpful. Since the AuttoKEGGRec is compatible 
with the COBRA toolbox, any correction could be directly added 
to the existing data [28]. 


In this stage, the refined biochemical information is converted into 
a mathematical format. MATLAB supplemented with the SBML 
toolbox, COBRA toolbox, and an LP solver could automate this 
process [28-30]. Moreover, the system boundaries and simulation 
constraints are defined in this stage, which convert the GEMs to 
condition-specific models. Due to the increasing abundance of 
biological and biochemical information, fine-tuned constraints 
could be set to improve the accuracy and reliability of the model 
compared to the actual metabolism. 


This stage consists of model verification and evaluation. Although 
reconstruction refinement is performed in stage 2, there could still 
be some omissions or errors in the metabolic model and mathe- 
matical model, including inappropriate constraints, missing trans- 
port reactions or exchange reactions, dead-end metabolites, 
network gaps, etc. It is important to test whether biomass and 
biomass precursors can be produced in specific media in this 
stage. This contributes to analyzing and investigating the difference 
between the simulated result and actual metabolism and further 
refining the GEMs. Iterative manual refinement is important and 
necessary to reconstruct a high-quality GEM. MATLAB and 
COBRA are helpful to identify and fix these problems. 


Once iterative and precise refinement is achieved, GEMs can be 
employed for in silico analysis. By defining desired and appropriate 
constraints, particular metabolic characteristics and metabolic flux 
distribution could be obtained to investigate the properties of 
microorganisms and to guide metabolic engineering. 


The introduction of plasmid pXMJ19-tfoX into V. natriegens could 
be achieved by electrotransformation [1, 31]: 


1. Inoculate V. natriegens in 5 mL LBv2 medium overnight at 
37 °C and 200 rpm. 


Metabolic Engineering of Vibrio natriegens 215 


Natural transformation FRT/FLP 


iad 
Oye CDP CE) + TT mee ore 
Ee eel ee 
~ “Pha =. . newieqent 
a NY ~O \ 


Next round i 
tfoX f haR Engineered 
UL PXMII9-tfoX \ _ Strain 
Ptac >> a 
\ "i puc Natural ou.) | Cure plasmid 
laclq es 


—=>>- : pBLI Transformation 
CmR orl pbL 


(Goo) ———— © 0) 


Eliminate marker 
Rhamnose induction 


Fig. 2 Genome-editing sketch map of V. natriegens. The general procedure is referenced from [5] 


2. Inoculate the overnight culture in 100 mL BHIv2 medium at a 
dilution of 1:100, and grow it at 37 °C and 200 rpm until an 
OD600 of 0.5. 


3. Transfer the culture to precooled 50 mL tubes and incubate on 
ice for 20-30 min. 


4. Centrifuge the culture at 4 °C and 6500 rpm for 15 min. 
5. Decant the supernatant and gently resuspend the cell pellets 
with 5-10 mL electroporation buffer. 


6. Add 20-30 mL electroporation buffer and centrifuge the cells 
at 6750 rpm and 4 °C for 15 min. 


7. Repeat the wash two or three times. 


8. Gently resuspend the cell pellets with electroporation buffer to 
obtain the final OD¢ 0 16. 


9. Divide cells into chilled tubes. 
10. Add plasmids to the cells and gently mix (2:100 v/v). 


11. Transfer the mixture to a precooled 1 mm electroporation 
cuvette and electroporate with the following parameters: 
800 V, 25 pF, 200 Q, and 1 mm cuvette. 


12. After electroporation, add 500 pL BHIv2 medium immedi- 
ately, and culture the mixture at 37 °C and 200 rpm for 1-2 h 
for recovery. 


13. Plate out the culture on solid LBv2 plates containing chloram- 
phenicol, and incubate overnight at 37 °C for colony growth. 


3.2.2 Natural 1. Incubate the strains harboring pXMJ19-tfoX overnight in 
Transformation LBv2 medium with 5 pg/mL chloramphenicol and 1 mM 
IPTG at 30 °C and 200 rpm [32]. 


216 Ye Zhang et al. 


3.2.3 Elimination of the 
Selection Marker and 
Curation of Plasmid 
pXMJ19-tfoX 


3.3 Systems 
Metabolic Engineering 
of V. natriegens for 
the Production 

of 1,3-Propanediol 


2. Dilute the overnight culture 100 times with ocean salt medium 
containing 1 mM IPTG. 


3. Add 200 ng recombinant DNA fragment. 


4. Incubate the mixture statically at 30 °C for 4-6 h for natural 
transformation. 


5. Culture the cells at 30 °C and 200 rpm for recovery with the 
supplement of 1 mL of LBv2 medium. 


6. Plate out the recovery culture on solid LBv2 plates containing 
appropriate antibiotics, and incubate overnight at 37 °C for 
colony growth. 


To eliminate the selection marker: 


1. Culture the selected strain in 5 mL LBv2 medium with 1 mM 
rhamnose at 37 °C and 200 rpm for 12 h. 


2. Dilute the overnight culture for 100 times with LBv2 medium, 
and cover it on solid LBv2 plates with only chloramphenicol to 
maintain pXMJ19-tfoX. 


3. Screen the strains without selection markers by testing the 
antibiotic resistance of the colony or colony PCR. 


To cure plasmid pXMJ19-tfoX: 


1. Culture the strains in 5 mL LBv2 medium without antibiotics 
at 37 °C and 200 rpm for 12 h. 


2. Dilute the overnight culture for 10,000 times with LBv2 
medium, and cover it on solid LBv2 plates without any 
antibiotics. 


3. Screen the strains without plasmid by testing the antibiotic 
resistance of the colony or colony PCR. 


1,3-Propanediol (1,3-PDO) is a valuable chemical that is used as a 
solvent, an antifreeze, and a monomer for the synthesis of poly- 
ethers, polyurethanes, and polyesters. Importantly, 1,3-PDO can 
be used as a building block for the synthesis of a high-performance 
polyester, polytrimethylene terephthalate (PTT), which is widely 
used in carpets, automotive fabrics, furnishings, garments, and 
many other industries [5, 33]. 

According to the protocol described in part 2,a GEM of Vibrio 
natriegens through AutoKEGGRec is generated [19, 20]. After iter- 
ative refinement, the general compositions of the model are shown in 
Table 2. Then the synthetic pathway of 1,3-PDO from glycerol is 
introduced into refined GEM and imported into OptFlux 
[5, 33]. By investigating and analyzing the perturbation of the 
heterologous 1,3-PDO synthesis pathway to the metabolic flow 
distribution of V. natriegens, we developed several systems metabolic 
engineering strategies to enhance the production of 1,3-PDO: 


Metabolic Engineering of Vibrio natriegens 217 


Table 2 
Composition of the refined GEM of V. natriegens 


Gene Gene 1183 
Gene rules 1613 

Metabolite Metabolite 1299 
Compartments 3 

Reaction Reaction 1527 
Metabolism 1265 
Transport reaction 179 
Exchange reaction 83 

1. Knockout of genes involved in byproduct formation. 


2. Improvement of the intracellular reducing environment. 


. Balance of the 1,3-PDO synthesis module and _glycerol- 


oxidative pathway. 


. Optimization of the cultivation process. 


According to these strategies, the following gene modifications 


and process optimization are carried out: 


1. 


Deletion of the adhE, IdhA, pta-ackA, pfl, and aldAB genes to 
block metabolic fluxes to ethanol, lactate, acetate, formate, and 
3-hydroxypropionic acid. 


. Deletion of the global transcriptional regulators ArcA and 


GlpR to improve glycerol metabolism and increase the intra- 
cellular reducing power. 


. Deletion of sthA gene and overexpression of putAB genes to 


further improve the intracellular concentration of NADPH. 


. Pathway engineering by combinatorial optimization to balance 


the 1,3-PDO synthesis module and glycerol-oxidative pathway 
and reduce the accumulation of toxic intermediate metabolite 
3-hydroxypropionaldehyde. 


. Optimization of fermentation process by adjusting the dis- 


solved oxygen. 


The performance of the engineered strain is tested via fed-batch 


fermentation in a 400 ml T&J minibox parallel bioreactor. 


For fed-batch fermentation, seed culture is grown for 5 h in 


LBv2 medium at 37 °C and 200 rpm and then inoculated into 
parallel bioreactors (10% v/v). The fermentations are conducted at 
37 °C, pH 6.5 (controlled with 5 M NaOH), and an aeration rate 
of 2.0 vvm. The rotation speed is automatically adjusted to main- 
tain the dissolved oxygen to 10% of the saturated oxygen. A feeding 


218 


Ye Zhang et al. 


medium is added to maintain the concentration of glycerol over 
10 g/L. The final engineered strain can efficiently produce about 
70 g/L. 1,3-PDO from glycerol with a yield of 0.61 mol/mol and a 
productivity of 2.36 g/L/h in fed-batch fermentation. 


. Since biological information is constantly updated and revised, 


GEMs should be updated thereupon. 


. Wrong reaction directionality could result in abnormal meta- 


bolic flow distribution and futile cycle, which should be revised 
based on published literature and thermodynamic data. 


. Experiments or thermodynamic data could be helpful to set 


appropriate boundary conditions for simulations. 


. V. natriegens grows very quickly. The experiment period should 


. The cell concentration has a great influence on the natural 


transformation efficiency. ~10° CFUs in 350 pL ocean salt 


. V. natriegens has a strong tolerance to kanamycin. False- 


positive clones should be discriminated during screen. 


. The time of rhamnose induction is important, and 12 h is 


. Statical incubation is required during natural transformation. 


This work was supported by genome-scale metabolic model the 
National Natural Science Foundation of China (Grant Nos. 
21878172, 21938004, and 22078172), the National Key R&D 
Program of China (No. 2018YFA0901500), and the DongGuan 
Innovative Research Team Program (No. 201536000100033). 


(2016) Vibrio natriegens as a fast-growing host 
for molecular biology. Nat Methods 13(10): 
849-851 


. Maida I, Bosi E, Perrin E et al (2013) Draft 


genome sequence of the fast-growing bacte- 
rium vibrio natriegens strain DSMZ 759. 
Genome Announc 1(4):e00648-c00613 


. Hoffart E, Grenz S, Lange J et al (2017) High 


substrate uptake rates empower Vibrio 


4 Notes 
be arranged wisely. 
medium is recommended. 
recommended. 
Acknowledgments 
References 
1. Weinstock MT, Hesek ED, Wilson CM et al 


natriegens as production host for industrial 
biotechnology. Appl Environ Microbiol 


83(22):c01614-c01617 


. Erian AM, Freitag P, Gibisch M et al (2020) 


High rate 2,3-butanediol production with Vib- 
rio natriegens. Bioresour Technol Rep 10: 
100408 


. Zhang Y, Li Z, Liu Y et al (2021) Systems 


metabolic engineering of Vibrio natriegens for 


10. 


ll. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


the production of 1,3-propanediol. Metab Eng 
65:52-65 


. Becker W, Wimberger F, Zangger K (2019) 


Vibrio natriegens: an alternative expression sys- 
tem for the high-yield production of isotopi- 
cally labeled proteins. Biochemistry 58(25): 
2799-2803 


. Schleicher L, Muras V, Claussen B et al (2018) 


Vibrio natriegens as host for expression of mul- 
tisubunit membrane protein complexes. Front 
Microbiol 2018:2537 


. Xu J, Dong F, Wu M et al (2021) Vibrio natrie- 


gens as a pET-compatible expression host com- 
plementary to Escherichia coli. Front Microbiol 
12:147 


. Des Soye BJ, Davidson SR, Weinstock MT et al 


(2018) Establishing a high-yielding cell-free 
protein synthesis platform derived from Vibrio 
natriegens. ACS Synth Biol 7(9):2245-2255 


Wiegand DJ, Lee HH, Ostrov N et al (2018) 
Establishing a cell-free Vibrio mnatriegens 
expression system. ACS Synth Biol 7(10): 
2475-2479 


Wang Z, Tschirhart T, Schultzhaus Z et al 
(2020) Melanin produced by the fast-growing 
marine bacterium vibrio natriegens through 
heterologous biosynthesis: characterization 
and application. Appl Environ Microbiol 
86(5):e€02749-e02719 

Dalia TN, Hayes CA, Stolyar S et al (2017) 
Multiplex genome editing by natural transfor- 
mation (MuGENT) for synthetic biology in 
Vibrio natriegens. ACS Synth Biol 6(9): 
1650-1655 

Lee HH, Ostrov N, Wong BG et al (2019) 
Functional genomics of the rapidly replicating 
bacterium Vibrio natriegens by CRISPRi. Nat 
Microbiol 4(7):1105-1113 

Tschirhart T, Shukla V, Kelly EE et al (2019) 
Synthetic biology tools for the fast-growing 
marine bacterium Vibrio natriegens. ACS 
Synth Biol 8(9):2069-2079 

Kim B, Kim WJ, Kim DI et al (2015) Applica- 
tions of genome-scale metabolic network 
model in metabolic engineering. J Ind Micro- 
biol Biotechnol 42(3):339-348 

Zhang C, Hua Q (2015) Applications of 
genome-scale metabolic models in biotechnol- 
ogy and systems medicine. Front Physiol 6:413 
Pollack-Berti A, Wollenberg MS, Ruby EG 
(2010) Natural transformation of Vibrio fischeri 
requires tfoX and ¢foY. Environ Microbiol 
12(8):2302-2311 

Orth JD, Conrad TM, Na J et al (2011) A 
comprehensive genome-scale reconstruction 
of Escherichia coli metabolism. Mol Syst Biol 
7:535 


Metabolic Engineering of Vibrio natriegens 


19 


20. 


21. 


22, 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


219 


. Thiele I, Palsson BO (2010) A protocol for 
generating a high-quality genome-scale meta- 
bolic reconstruction. Nat Protoc 5(1):93-121 


Karlsen E, Schulz C, Almaas E (2018) Auto- 
mated generation of genome-scale metabolic 
draft reconstructions based on KEGG. BMC 
Bioinformatics 19(1):467 

Allen B, Drake M, Harris N et al (2017) Using 
KBase to assemble and annotate prokaryotic 
genomes. Curr Protoc Microbiol 46(1):1E 
13 1-1E 13 18 


Devoid S, Overbeek R, DeJongh M et al 
(2013) Automated genome annotation and 
metabolic model reconstruction in the SEED 
and Model SEED. Methods Mol Biol 
985(985):17-45 

Abecasis GR, Cherny SS, Cookson WO et al 
(2002) Merlin—rapid analysis of dense genetic 
maps using sparse gene flow trees. Nat Genet 
30(1):97-101 

Caspi R, Altman T, Dreher K et al (2012) The 
MetaCyc database of metabolic pathways and 
enzymes and the BioCyc collection of path- 
way/genome databases. Nucleic Acids Res 40 
(Database issue):D742—D753 


Wang H, Marcisauskas S$, Sanchez BJ et al 
(2018) RAVEN 2.0: a versatile toolbox for 
metabolic network reconstruction and a case 
study on Streptomyces coelicolor. PLoS Comput 
Biol 14(10):e1006541 


Mendoza SN, Olivier BG, Molenaar D et al 
(2019) A systematic assessment of current 
genome-scale metabolic reconstruction tools. 
Genome Biol 20(1):158 


Aite M, Chevallier M, Frioux C et al (2018) 
Traceability, reproducibility and wiki- 
exploration for “a-la-carte” reconstructions of 
genome-scale metabolic models. PLoS Com- 
put Biol 14(5):e1006146 


Hyduke D, Hyduke D, Schellenberger J et al 
(2011) COBRA toolbox 2.0. Protocol 
Exchange. https://doi.org/10.1038/protex. 
2011.234 

Vilaca P, Noronha A, Rocha I et al (2014) 
Visualization Plugin for Optflux: tools for the 
creation of metabolic layouts and analysis of 
flux distributions. COBRA 2014 - 3rd Confer- 
ence on Constraint-Based Reconstruction and 
Analysis. Charlottesville, VA, USA 

Heirendt L, Arreckx S, Pfau T et al (2019) 
Creation and = analysis of biochemical 
constraint-based models using the COBRA 
Toolbox v.3.0. Nat Protoc 14(3):639-702 
Gonzales MF, Brooks T, Pukatzki SU et al 
(2013) Rapid protocol for preparation of elec- 
trocompetent Escherichia coli and Vibrio cho- 
lerae. J Vis Exp 80:e50684 


220 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


Ye Zhang et al. 


Dalia AB, McDonough E, Camilli A (2014) 
Multiplex genome editing by natural transfor- 
mation. Proc Natl Acad Sci U S A 111(24): 
8937-8942 

Zhang Y, Liu D, Chen Z (2017) Production of 
C2-C4 diols from renewable bioresources: new 
metabolic pathways and metabolic engineering 
strategies. Biotechnol Biofuels 10:299 

Bocs S, Cruveiller S, Vallenet D et al (2003) 
AMIGene: annotation of MIcrobial genes. 
Nucleic Acids Res 31(13):3723-3726 

Profiti G, Martelli PL, Casadio R (2017) The 
Bologna annotation resource (BAR 3.0): 
improving protein functional annotation. 
Nucleic Acids Res 45(W1):W285-W290 
Gentleman RC, Carey VJ, Bates DM et al 
(2004) Bioconductor: open software develop- 
ment for computational biology and bioinfor- 
matics. Genome Biol 5(10):R80 

Liolios K, Chen IM, Mavromatis K et al (2010) 
The Genomes On Line Database (GOLD) in 
2009: status of genomic and metagenomic 
projects and their associated metadata. Nucleic 
Acids Res 38(Database issue ):D346-D354 
Moriya Y, Itoh M, Okuda S et al (2007) KAAS: 
an automatic genome annotation and pathway 
reconstruction server. Nucleic Acids Res 35 
(Web Server issue):W182—W185 

Maglott D, Ostell J, Pruitt KD et al (2005) 
Entrez Gene: gene-centered information at 
NCBI. Nucleic Acids Res 33(Database issue): 
D54-D58 

Aziz RK, Bartels D, Best AA et al (2008) The 
RAST server: rapid annotations using subsys- 
tems technology. BMC Genomics 9(1):75 


41. 


42. 


43 


44. 


45. 


46. 


47 


48. 


Scheer M, Grote A, Chang A et al (2011) 
BRENDA, the enzyme information system in 
2011. Nucleic Acids Res 39(Database issue): 
D670-D676 

Kanehisa M, Goto S (2000) KEGG: Kyoto 
encyclopedia of genes and genomes. Nucleic 
Acids Res 28(1):27-30 


. Wang Y, Xiao J, Suzek TO et al (2009) Pub- 


Chem: a public information system for analyz- 
ing bioactivities of small molecules. Nucleic 
Acids Res 37(Web Server issue): W623—W633 


Saier MH Jr, Tran CV, Barabote RD 
(2006) TCDB: the transporter classification 
database for membrane transport protein ana- 
lyses and information. Nucleic Acids Res 34 
(Database issue):D181-—D186 

Elbourne LD, Tetu SG, Hassan KA et al 
(2017) TransportDB 2.0: a database for 
exploring membrane transporters in sequenced 
genomes from all domains of life. Nucleic Acids 
Res 45(D1):D320-D324 

Wu CH, Apweiler R, Bairoch A et al (2006) 
The uUniversal Protein Resource (UniProt): 
an expanding universe of protein information. 
Nucleic Acids Res 34(Database issue):D187-— 
D191 


. Van Domselaar GH, Stothard P, Shrivastava S 


et al (2005) BASys: a web server for automated 
bacterial genome annotation. Nucleic Acids 
Res 33(Web Server issue):W455-W459 

Nakai K, Horton P (1999) PSORT: a program 
for detecting sorting signals in proteins and 
predicting their subcellular localization. Trends 
Biochem Sci 24(1):34-35 


® 
ae Chapter 12 


Application of GeneCloudOmics: Transcriptomic Data 
Analytics for Synthetic Biology 


Mohamed Helmy and Kumar Selvarajoo 


Abstract 


Research in synthetic biology and metabolic engineering require a deep understanding on the function and 
regulation of complex pathway genes. This can be achieved through gene expression profiling which 
quantifies the transcriptome-wide expression under any condition, such as a cell development stage, 
mutant, disease, or treatment with a drug. The expression profiling is usually done using high-throughput 
techniques such as RNA sequencing (RNA-Seq) or microarray. Although both methods are based on 
different technical approaches, they provide quantitative measures of the expression levels of thousands of 
genes. The expression levels of the genes are compared under different conditions to identify the differen- 
tially expressed genes (DEGs), the genes with different expression levels under different conditions. DEGs, 
usually involving thousands in number, are then investigated using bioinformatics and data analytic tools to 
infer and compare their functional roles between conditions. Dealing with such large datasets, therefore, 
requires intensive data processing and analyses to ensure its quality and produce results that are statistically 
sound. Thus, there is a need for deep statistical and bioinformatics knowledge to deal with high-throughput 
gene expression data. This represents a barrier for wet biologists with limited computational, programming, 
and data analytic skills that prevent them from getting the full potential of the data. In this chapter, we 
present a step-by-step protocol to perform transcriptome analysis using GeneCloudOmics, a cloud-based 
web server that provides an end-to-end platform for high-throughput gene expression analysis. 


Key words Synthetic biology, Transcriptomic data analysis, RNA-Seq, Bioinformatics, Biostatistics 


1. Introduction 


The recent rapid increase in the global human population raises the 
demands for food, drugs, and energy. Thus, novel and innovative 
approaches are required to fill the increasing gap between supply 
and demand. Metabolic engineering and synthetic biology are two 
promising fields that hold promising potential in boosting the 
food, drug, and fuel industries [1]. 

Metabolic engineering is the alteration of the metabolism of an 
organism to produce new compounds (protein, enzyme, or metab- 
olite) or to increase the yield of an existing one [2]. On the other 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_12, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


221 


222 


Mohamed Helmy and Kumar Selvarajoo 


hand, synthetic biology is the art of manipulating existing organ- 
isms to give them new abilities [3] or using cell-free systems as a 
bioengineering platform for manufacturing, diagnostics, and 
research applications [4]. These fields have made several contribu- 
tions to industry and research including creating plants that tolerate 
environmental changes [5] or bioremediation [6], developing bio- 
sensors for analyte detection [7, 8], bioproduction of several com- 
modities [9, 10], biofuel production, and several applications in 
regenerative medicine and immunology [4, 11, 12]. 

The processes of manipulating and optimizing the genetics and 
the growth conditions of an organism to increase the production of 
certain substances or add a new ability involve multiple steps and 
can have different scenarios [13]. It starts from optimizing the 
growth of an existing organism all the way to transferring a whole 
gene cluster of a pathway from a selected organism to a model 
organism and tuning the model organism’s genetics and growth 
to increase the yield of a substance or add a new ability [4, 14]. The 
increase in the yield must be maximized so that the production can 
be economically sound. Therefore, these processes aim to increase 
the yield by thousands of folds [15]. To achieve this goal, metabolic 
engineering and synthetic biology utilize modern biomedical 
research techniques. This includes multi-omics approaches (geno- 
mics, transcriptomics, proteomics, and metabolomics), genome- 
editing techniques, systems biology, artificial intelligence, and bio- 
informatics [16]. These techniques aim to define the “system” to 
be manipulated and optimized, which is usually the biological 
pathway(s) for producing the desired substrate or ability. Among 
these techniques, transcriptomics plays a crucial role in identifying 
the pathways of genes involved in them and, therefore, contribut- 
ing to the acceleration of the research and applications within the 
field [17]. 

The outstanding advancements in the genome and transcrip- 
tome sequencing, genome editing, and genetic engineering in the 
last two decades, the significant decrease in the cost of molecular 
laboratory techniques, and the rapid development of new bioinfor- 
matics and computational biology tools and algorithms gave rise to 
different research fields in the biomedical space [18]. Synthetic 
biology is one of the major fields that benefited from these advance- 
ments. It aims to create new biological products (parts, devices, or 
systems) or redesign the existing biological systems to make them 
perform new functions such as producing new compounds that 
they do not produce in nature [19]. The applications of synthetic 
biology cover a wide range of industries including drug and vaccine 
development, research reagent production, biosensing, biofuels, 
and biomaterials [20, 21]. As a result, the recent years experienced 
a noticeable rise in synthetic biology start-ups, as well as adopted by 
large companies in the field of pharma, biotechnology, and 


2 Materials 


Transcriptomics Data Analytics for Synthetic Biology 223 


chemical industries resulting in a multibillion business that is based 
on synthetic biology [22]. 

Transcriptomics is a powerful tool in studying biological sys- 
tems and elucidating gene functions [23]. In synthetic biology, 
transcriptomics plays a crucial role in guiding the design processes 
and the development of new devices or systems [17]. Since tran- 
scriptomics provides a snapshot of global gene expressions profile 
of the cell, analyzing them reveals the genes and pathways involved 
in the investigated process. Comparing gene expression profiles of 
different conditions, treatments, or time points, the differential 
expression analysis (DEA) identifies key genes and pathways that 
can be used to modify biological processes or to add a new product 
or increase the yield of an existing one [24]. Synthetic biology 
utilizes transcriptome analysis to understand the mechanistic bases 
of gene functions and their regulation models which allow altering 
gene regulations or designing a synthetic promoter [17]. It is also 
used to improve medicinal plants [25], enhance our understanding 
to plant signaling pathways by combining transcriptomics and bio- 
sensors [26], find new model organisms and chassis for microalgae 
synthetic biology [27], and produce biofuel and chemicals from 
bacteria [23]. 

The high-throughput transcriptome analysis platforms, such as 
RNA-Seq and microarray, measure the level of expression of 
thousands of genes in multiple conditions, of different develop- 
mental stages, or under different treatment conditions [28]. The 
analysis of this data requires processing the raw gene expression 
data to get the expression levels (e.g., read counts), performing 
filtering and quality control (QC) steps to remove noise and 
low-quality data, preprocessing and normalizing the expression 
levels, statistically analyzing the data, identifying the differentially 
expressed genes (DEGs) between different conditions, and 
performing a functional analysis to elucidate the pathways and 
cellular functions of the DEG [29] (Fig. 1). Such analysis involves 
multiple challenges related to the data size, data quality, statistical 
analysis, visualization, and interpretation of the results using the 
bioinformatics tool [30, 31]. 


In this chapter, we will use data from engineered Arabidopsis cells. 
The data is from a study that aimed to investigate the activities of the 
leucine-rich repeat receptor kinases (LRR-RKs) independent from 
the endogenous receptors. LRR-RKs are large group of receptor 
kinases with over 200 members in Arabidopsis. The authors devel- 
oped a novel synthetic biology tool for investigating LRR-RK signal- 
ing kinases in plants by developing rapamycin-inducible dimerization 
(RiD) receptors that operate under the control of rapamycin (Rap), 


Microarray Scanner RNA Sequencer 


Data generation 


ww 


Processing & QC 


Normalization 


regulation of CeOwset kn whe pten “ e 


Bioinformatics Analysis Statistical Analysis DE Analysis 


Fig. 1 An overview of the gene expression profiling workflow. The transcriptomic data (RNA-Seq or microarray) 
generated by the experimental instrument is processed through the quality control (QC) steps. Next, the data 
that passed the QC step is normalized. Multiple statistical tests can be performed on the normalized data. The 
normalized data and different methods are used to infer the differential gene expressions (DGEs). Bioinfor- 
matics analyses on the differentially expressed genes (DEGs) provide functional inference and pathway 
association 


3 Methods 


3.1 Overview of 
GeneCloudOmics 
Server 


3.1.1 Supports Different 
Transcriptomic Data Types 


3.1.2 Provides Multiple 
Preprocessing Methods 


3.1.3 Performs Nine 
Biostatistical Tests 


Transcriptomics Data Analytics for Synthetic Biology 225 


which avoids interference with endogenous receptors [32]. The data 
consists of five different conditions where the wild-type and the 
engineered cells (RiD-BRI1/BAK1) were untreated (M) or treated 
with rapamycin (Rap) or brassinolide (BL) and two replicates per 
condition (GEO accession: GSE136177). 


Several tools are available for analyzing transcriptomic data in the 
form of R packages, Python libraries, or software tools that use 
existing libraries through a GUI (reviewed by [28]). Nevertheless, 
the analysis of gene expression data remains a burden, its intensive 
statistical and programming skill requirement that many biologists 
who use online biological resources are missing [33]. Furthermore, 
most of the available tools focus on the data preprocessing and DEG 
identification with less focus on the statistical analysis and even less 
attention to the downstream functional interpretation [28]. 

With the challenges that biologists are facing while analyzing 
transcriptomic data in mind, we developed GeneCloudOmics to 
provide a one-stop server that performs that whole analysis [ref]. 
Online biological resources are the easiest resources to be used 
since they are all equipped with GUI that allow the users to perform 
the analysis with minimal computational skills and without local 
installation or the need for programming. GeneCloudOmics was 
developed as an online web server that performs end-to-end tran- 
scriptomic data analysis starting with preprocessing the raw read 
count data, performing different statistical tests, and identifying the 
differentially expressed genes (DEGs) and the downstream bioin- 
formatics analysis of the DEG set. 


GeneCloudOmics supports RNA-Seq and microarray (.cel files) 
data. Both types of data can be either uploaded to the server or 
directly imported from the NCBI (Gene Expression Omnibus) 
GEO database by providing the GEO accession of the transcription 
dataset to the designated form. 


GeneCloudOmics provides four normalization techniques (RPKM, 
FPKM, TPM, RUV) that are commonly used with read counts. The 
normalized data can be plotted against the raw data in box plots and 
violin plots with an option to download the normalized in CSV 
format. 


For both preprocesses and processed transcriptomic data, Gene- 
CloudOmics allows performance of several statistical tests. This 
includes read normalization for the preprocessed data and scatter 
plots, correlations (linear and nonlinear), PCA (2D and 3D), and 
clustering (hierarchical, k-means, t-SNE, and SOM). The results of 
all tests are plotted in a publication-ready quality. 


226 Mohamed Helmy and Kumar Selvarajoo 


3.1.4 Identifying 
Differentially Expressed 
Genes (DEG) 


3.1.5 Interprets and 
Analyzes Gene and Protein 
Lists 


3.1.6 Creates a 
Customized Analysis 
Report 


3.2 Statistical Tests 
in Transcriptomic Data 
Analysis 


3.2.1 Scatter Plot 


GeneCloudOmics provides an implementation of three of the most 
commonly used DEG methods: DESeq2 [34], NOISeq [35], and 
EdgeR [36]. The three methods can be used through a single 
interface. The user selects the method of choice to perform the 
differential gene expression analysis; then GeneCloudOmics pro- 
vides the user with the parameters of the selected method. The list 
of DEGs can be downloaded in CSV format, and the results of the 
differential gene expression analysis can be plotted in volcano and 
dispersion plots. 


The list of the DEGs can be interpreted by GeneCloudOmics using 
11 different bioinformatics tools in order to investigate their func- 
tions, pathways, and disease relevance and study their physicochem- 
ical and evolutionary properties. The bioinformatics interpretation 
features of GeneCloudOmics can also be used independently from 
the transcriptomic data analysis workflow to interpret any given list 
of genes or proteins. The bioinformatics tools of GeneCloudOmics 
set it apart from all the available tools since most of the gene 
differential expression analysis tools do not include bioinformatics 
features for gene set analysis or include a few basic analyses such as 
GO and pathway enrichment [28]. Moreover, GeneCloudOmics 
provides all common protein and gene bioinformatics tools includ- 
ing GO enrichment analysis, pathway enrichment analysis, complex 
enrichment, protein-protein interaction (PPI), protein function, 
protein subcellular localization, protein domains, tissue expression, 
gene co-expression, protein physicochemical properties, protein 
evolutionary analysis, and protein pathological analysis. 


GeneCloudOmics provides the user with the option of creating an 
analysis report that gathers and summarizes the results and plots 
that the user finds interesting. The user can click the “Add to 
Report” option on the left-hand side in all GeneCloudOmics 
tests. This will add the plot and the analysis title to the analysis 
report. At the end of the session, the user can go to the “Analysis 
report” page in the main menu. GeneCloudOmics will generate a 
report that contains the added plots. The report is generated in 
HTML format and can be downloaded as a single PDF file. 


There are several biostatistical tests and data analytics that are used 
in the analysis of transcriptomic data. GeneCloudOmics provides 
ten of the most used tests and analytics in analyzing the data (such 
as PCA and Pearson correlation) and assessing the quality of the 
data (such as noise and entropy analysis). Here, we briefly overview 
each of them. 


The scatter plot compares the level of expression of the genes in any 
two conditions or two replicates. It displays the respective expres- 
sion of all genes in a 2D space. Before creating a scatter plot, it is 
recommended to perform normalization for sequencing depth (see 


3.2.2 Distribution Fitting 


3.2.3 Correlation 


Pearson Correlation 


Spearman Correlation 


Transcriptomics Data Analytics for Synthetic Biology 227 


the preprocessing stage below) for this step. Since gene expression 
data is naturally skewed toward very high expression-level regions, 
it is recommended to apply a log transformation to the data to 
capture the whole data range. At GeneCloudOmics, the users can 
choose between natural log, log base 2, and log base 10 and add an 
optional linear regression line to the plot as well. Gene expression 
data are densely distributed in the lowly expressed region, making 
the dots usually indistinguishable in a regular scatter plot. Gene- 
CloudOmics overlay a 2D kernel density estimation on the scatter 
plot to visualize the density of expression level. 


GeneCloudOmics provides several distribution fitting options that 
compare the entire gene expressions to different continuous statis- 
tical distributions, which can be used to test the data and choose a 
nonarbitrary statistically based lower expression cutoff. To visualize 
the comparison, GeneCloudOmics displays the cumulative distri- 
bution function of the preprocessed gene expression data with the 
user-selected theoretical distributions. Once it is confirmed that the 
gene set follows a particular distribution, it would be safe to con- 
clude the validity of the gene expression data. GeneCloudOmics 
also provides a table that shows the best-fitted distribution in each 
sample. 


Pearson correlation measures the linear relationship between two 
vectors. The Pearson correlation coefficient 7 = 1 if the two vectors 
are identical and 7 = 0 if there are no linear relationships between 
the vectors. The coefficient 7 between the two vectors (e.g., the 
transcriptome of two different samples), containing 7 observations 
(e.g., gene expression values), is defined by (for large 7): 


(x — Hx) (y; 7 Hy) 


(X,Y) = 
OxOr 


where x; and y; are the zth observation in the vectors X and Y; 
respectively, fx and ythe mean values of each vector, and o,,and oy 
the corresponding standard deviations. 


Spearman rank correlation is a nonparametric test that is used to 
measure the degree of association between two vectors (e.g., tran- 
scriptome in two different samples). The Spearman rank correlation 
test does not carry any assumptions about the distribution of the 
data and is the appropriate correlation analysis when the variables 
are measured on a scale that is at least ordinal. The following 
formula is used to calculate the Spearman rank correlation: 


6-1 (3 = a 
n(n? — 1) 


A(X, 7) =1 


where 7; and 7, ; are ranks of the 7th gene x; and 4; in vectors X and 
Y, respectively, and is the number of genes in vector (X, Y). 


228 Mohamed Helmy and Kumar Selvarajoo 


3.2.4 PCA 


3.2.5 Heatmap and Gene 
Clustering 


Principal component analysis (PCA) is a multivariate statistical 
technique for simplifying high-dimensional datasets [37]. Given 
m observations on 7 variables, the goal of PCA is to reduce the 
dimensionality of the data matrix by finding 7 new variables, where 
ris less than n. Termed principal components, these 7 new variables 
together account for as much of the variance in the original 
n variables as possible while remaining mutually uncorrelated and 
orthogonal. Each principal component is a linear combination of 
the original variables, and so it is often possible to ascribe meaning 
to what the components represent. A PCA analysis of transcrip- 
tomic data considers the genes as variables, creating a set of “prin- 
cipal gene components” that indicate the features of genes that best 
explain the experimental responses they produce. To compute the 
principal components, the eigenvalues and their corresponding 
eigenvectors are calculated from the ” x m covariance matrix of 
conditions. Each eigenvector defines a principal component. A 
component can be viewed as a weighted sum of the conditions, 
where the coefficients of the eigenvectors are the weights. The 
projection of gene 7 along the axis defined by the jth principal 
component is: 


n 
! P 

ai = AiVy 
t=1 


where »,;is the tth coefficient for the jth principal component, air is 
the expression measurement for gene z under the zth condition, and 
a is the data in terms of principal components. Since V is an 
orthonormal matrix, a’ is a rotation of the data from the original 
space of observations to a new space with principal component 
axes. The variance accounted for by each of the components is its 
associated eigenvalue; it is the variance of a component over all 
genes. Consequently, the eigenvectors with large eigenvalues are 
the ones that contain most of the information; eigenvectors with 
small eigenvalues are uninformative. 


Hierarchical clustering is used to find the groups of co-expressed 
genes [38]. The clustering is performed on normalized expressions 
of differentially expressed genes using Ward clustering method. 
Normalized expression of the jth gene at time ¢; is defined as: 


2; (ti) = (x;j(ti) — ¥) /0 
where x, ¢;) is the expression of the jth gene at time t,, 7 is the mean 
expression across all time points, and ?; is the standard deviation. 
GeneCloudOmics apply hierarchical clustering on the output 
of DE analysis using EdgeR [ref] in the previous section. Alterna- 
tively, the user can carry out clustering independently without 
going through DE analysis by specifying the minimum fold change 


of gene expression between two samples. GeneCloudOmics also 
lists the name of genes for each cluster in the Gene Clusters tab. 


3.2.6 Transcriptome- 
Wide Average Noise 


3.2.7 Entropy 


3.2.8 Random Forest 
Clustering 


3.2.9 Self-Organizing 
Map (SOM) 


Transcriptomics Data Analytics for Synthetic Biology 229 


To quantify between gene expressions scatter of all replicates 
in one experimental condition, GeneCloudOmics compute 
transcriptome-wide average noise for each cell type, defined as: 


jd ae 
Trot ~ es: 
where 7 is the number of genes and 77 is the pairwise noise of the 
ith gene (variability between any two replicates), defined as: 


2 m—1 m 
Pee o> a 2 
1; ~— mm = 1) Des Dulin 


where m is the number of replicates in each condition and Nie is the 
expression noise of the zt gene, defined by the variance divided by 
the squared mean expression in the pair of replicates ( 7,h). 


Shannon entropy [39] measures the disorder of a high-dimensional 
system, where higher values indicate an increasing disorder. The 
entropy of each transcriptome, X, is defined as: 


H(X) =—)—" pli) log p(xi) 


where p(x;) is the probability of gene expression value x = x;. The 
entropy values are obtained through histogram-based partitioning 
approach, and the number of bins is determined using 
Doane’s rule: 0X) = 1 + log2m+log2(1 + |gX|/og), where gX is 
the skewness of the expression distribution of each sample, and 


og=vV6(n — 2)/(n + 1)(n + 3). 


Random forest clustering belongs to the unsupervised learning 
clustering approaches where each sample is clustered into different 
classes, based on their similarity (usually based on Euclidean dis- 
tance) [40]. The random forest algorithm is used to generate a 
proximity matrix — a rough estimate of the distance between sam- 
ples based on the proportion of times the samples end up in the 
same leaf node of the decision tree. The proximity matrix is con- 
verted to a dist matrix which is then input to the hierarchical 
clustering algorithm. The implementation of the random forest 
clustering in GeneCloudOmics is based on [41]. 


A self-organizing map (SOM) produces a two-dimensional, discre- 
tized representation of the high-dimensional gene expression 
matrix and is, therefore, a dimensionality reduction technique. 
Self-organizing maps use a neighborhood function to preserve 
the topological properties of the input gene expression matrix [42]. 

Each data point (one sample) in the input gene expression 
matrix recognizes itself by competing for representation. SOM 
mapping steps start from initializing the weight vectors. From 
there, a sample vector is selected randomly, and the map of weight 


230 Mohamed Helmy and Kumar Selvarajoo 


3.2.10  t-Distributed 
Stochastic Neighbor 
Embedding (t-SNE) 


3.3 Bioinformatics 
Analysis of the 
Differentially 
Expressed Genes 


3.3.1 Gene Ontology (GO) 
Annotation 


3.3.2 Pathway 
Enrichment Analysis 


vectors is searched to find which weight best represents that sample. 
Each weight vector has neighboring weights that are close to it. The 
weight that is chosen is rewarded by being able to become more like 
that randomly selected sample vector. The neighbors of that weight 
are also rewarded by being able to become more like the chosen 
sample vector. This allows the map to grow and form different 
shapes. Most generally, they form square/rectangular/hexago- 
nal/L shapes in 2D feature space. 


t-SNE is a dimensionality reduction approach that reduces the 
complexity of highly complex data such as transcriptomic data. It 
visualizes the sample interrelations in a two- or three-dimensional 
visualization. This allows the identification of the close similarities 
between samples through the relative location of mapped points. 
Since t-SNE is nonlinear and able to control the trade-off between 
local and global relationships among points, its visualization of the 
clusters is usually more compelling when compared with other 
methods [43]. GeneCloudOmics introduces an intuitive interface 
that allows performing t-SNE analysis on the processed untrans- 
formed transcriptomic data through entering three inputs: (1) per- 
plexity value, (2) the number of principal components (PC), and 
(3) the number of clusters. The user can also choose to log trans- 
form the data before submission. 


DGE analysis usually outputs a list of genes that are statistically 
determined as differentially expressed. Then, the list of DEGs is 
analyzed, interpreted, and annotated to learn more about the func- 
tions, pathways, and cellular processes that these genes are involved 
in. GeneCloudOmics provides 12 bioinformatics analyses that can 
be performed on a given gene/protein dataset. 


GeneCloudOmics performs GO annotation for a given set of pro- 
teins by reading the GO terms associated with them directly from 
UniProt Knowledgebase [44], then visualizes each of the three GO 
domains (cellular component, molecular function, and biological 
process) in an independent tab in a bar chart, as well as provides the 
annotation results in a downloadable tabular format. 


The pathway enrichment analysis of the DEGs produces a list of 
biological pathways that those genes are statistically determined to 
be involved in, not by chance. This provides the researchers with 
mechanistic insights on how those genes affect cellular functions 
[45]. For a given gene or protein set, GeneCloudOmics uses g: 
Profiler [46] to perform a pathway enrichment analysis and displays 
the results as a network where the nodes are the pathways and the 
edges are the overlap between the pathways. GeneCloudOmics 
uses Cytoscape.JS for the network visualization [47]. The enrich- 
ment results can also be downloaded as a CSV file. 


3.3.3 Protein-Protein 
Interaction 


3.3.4 Complex 


Enrichment 


3.3.5 Protein Function 


3.3.6 Protein Subcellular 
Localization 


3.3.7 Protein Domains 


3.3.8 Tissue Expression 


3.3.9 Gene Co- 
expression 


Transcriptomics Data Analytics for Synthetic Biology 231 


Investigating PPIs is one of the essential steps in systems biology 
studies. GeneCloudOmics provides the users with an interface 
where they can upload a set of proteins (UniProt accessions) and 
get all the interactions associated with them. The interactions are 
visualized as a network where the nodes are the proteins, and the 
edges are the interactions, and the node size corresponds to the 
number of interactors of the protein. We use Cytoscape.JS for PPI 
visualization [47]. The results are also displayed as an interaction 
table and can be downloaded as a network or an interaction table. 


GeneCloudOmics provides the user with a complex enrichment 
feature that allows the identification of proteins in the provided 
dataset that are part of a known protein complex. This feature uses 
CORUM databases, which contain curated complex information 
for mammalian proteins [48]. This feature provides the user with 
complex-forming proteins and complex information in the submit- 
ted dataset. 


GeneCloudOmics retrieves the protein function information from 
UniProt of a given protein set (UniProt accessions). The retrieved 
protein functions are displayed in a downloadable tabular format. 


The protein subcellular localization feature of GeneCloudOmics 
provides the user with an interface to get the subcellular localiza- 
tion information for a given list of proteins (UniProt accessions) 
and display the results in a downloadable tabular format. 


GeneCloudOmics provides the users with a protein domain feature 
that connects to UniProt Knowledgebase and retrieves the domain 
information associated with each protein in a given list of UniProt 
accessions. 


The tissue expression feature in GeneCloudOmics provides the user 
with the tissue expression for each protein in a given protein list 
(UniProt accessions) through retrieving this information from 
UniProt Knowledgebase. The result is displayed in a downloadable 
tabular format. 


The co-expression analysis is a common analysis that assesses the 
expression level of different genes to identify simultaneously 
expressed genes. The resultant co-expression networks are used to 
identify functionally related genes or genes being controlled by the 
same transcriptional mechanism [49]. GeneCloudOmics provides 
the user with an interface where they can submit a co-expression 
query to GeneMANIA [50] and then shows the results at Gene- 
MANIA’s website in a new tab. Currently, GeneCloudOmics sup- 
ports queries for nine model organisms including humans, yeast, 
E. colt, C. elegans, Arabidopsis, Drosophila, zebrafish, mouse, 
and rat. 


232 Mohamed Helmy and Kumar Selvarajoo 


3.3.10 Protein 
Physicochemical Properties 


3.3.11 Protein 
Evolutionary Analysis 


3.3.12 Protein 
Pathological Analysis 


3.4 Transcriptomic 
Data Analysis Using 
GeneCloudOmics 


3.4.1 The Required Data 


3.4.2 Importing or 
Uploading Data to 
GeneCloudOmics 


For a given set of proteins (UniProt accessions), GeneCloudOmics 
provides the user with the complete sequences of them in a single 
FASTA file and allows the user to investigate their physicochemical 
properties. The physicochemical analysis includes sequence charge, 
GRAVY index [51], and hydrophobicity. 


For a given set of proteins (UniProt accessions), GeneCloudOmics 
provides the user with a phylogenetic and evolutionary analysis that 
includes multiple sequence alignment (MSA) of the protein 
sequences, clustering based on the amino acid sequences, chromo- 
somal location, or gene tree. 


Several diseases are associated with the malfunction of certain genes 
or proteins. The disease-protein association is collected in different 
online resources such as OMIM database [52], DisProt [53], and 
DisGeNET [54]. GeneCloudOmics provides the users with an 
interface that retrieves the disease-protein association from online 
databases for a given list of proteins (UniProt accessions). The 
disease-protein association is visualized as bubble charts that show 
the distribution of the proteins among the disease or the distribu- 
tion of diseases among the proteins. 


In synthetic biology research, transcriptome analysis is one of the 
main approaches that helps provide a deeper understanding of the 
investigated system and in developing new tools. In this section, we 
will demonstrate how GeneCloudOmics can be employed in tran- 
scriptomic data analysis for synthetic biology applications. 


GeneCloudOmics supports RNA-Seq and microarray data 
[28]. The RNA-Seq can be in the form of raw read counts, which 
will go through multiple steps of preprocessing and normalization 
or normalized read counts which will be analyzed directly. The 
microarray data is supported as CEL files. Multiple CEL files can 
be uploaded to GeneCloudOmics as one compressed file. The data 
can also be imported directly from NCBI Gene Expression Omni- 
bus (GEO) database using GEO accession numbers. Figure 2 
shows the different data import and upload methods in 
GeneCloudOmics. 


We downloaded the RNA-Seq data of the engineered Arabidopsis 
cells from GEO, decompressed it, and created a metadata file 
(Table 1). Since the data is raw, we used the raw file (read count) 
upload option (Fig. 3a). This option requires two files: (1) the raw 
read count file and (2) the metadata file (Table 1). Optionally, two 
more files can be uploaded: (1) the gene length file, which is 
required for the RPKM, FBKM, and TPM normalization, and 
(2) the negative controls (e.g., ERCC Spike-In) file, which is 
required for the RUV normalization. Several data upload and 
import alternatives are available as mentioned above (Fig. 3b-e). 


A Input 
Transcriptome Gene/Protein List 


Upload or Import from GEO - Upload or Paste 


uccaralor|s0AS=] — Genecieraomics [rsh |or| ewer ] 


A Data Aeatytc Chet Pumtorm tor Gene (aprenwon Anatyue and Vieuutraton 


D Gene/Protein Bioinformatics 


Protein-Protein Interactions 


B Preprocessing 


Box plot (Raw vs. Norm.) 


Protein Evolutionary Analysis 
Gone Tree Phylogenetic Tree 


i 


*., 


Chromosome Location * 


| aisha unm 


Biostatistical Analysis DE Analysis Clustering 


Spearman Correlations | 2D and3D PCA edgeR, DESeq2, NOISeq (Samples and Genes) 
sae PCAvarlince PCA 2D plot «= PCA ID plot Heatmap 


es | a 


Volcano Plot ._ 


Estimated dispersion 


es | — 


Fig. 2 An overview of GeneCloudOmics features. (a) RNA-Seq and microarray data uploading and importing. 
(b) Data preprocessing and normalization using four different methods(upper quartile, FPKM, RPKM, and TPM). 
(c) Transcriptomic data analysis that includes DGE analysis and multiple biostatistical tests (distributions 
fitting, scatter plots, correlations, PCA), noise analysis, and clustering methods (hierarchical, k-means, t-SNE, 


Self-Organizing Map 


Noise Analysis 


234 Mohamed Helmy and Kumar Selvarajoo 


3.4.3 Data Preprocessing 


3.4.4 Biostatistical 
Analysis of Normalized 
Data 


Scatter Plot 


< 


Table 1 
The contents of the metadata file 


Sample Time 
CM1 tl 
CM2 tl 
CRI t2 
CR2 2 
RM1 t3 
RM2 3 
RR1 t4 
RR2 t4 
RB1 t5 
RB2 t5 


Once the data upload is complete, we can start doing the prepro- 
cessing and normalization to normalize the raw data and plot the 
raw data vs. the normalized data. First, go to the tab preprocessing, 
and enter the required values of the minimum value and a mini- 
mum number of columns; here, we are using the default values of 
1 and 2, respectively. Next, choose the normalization method and 
click submit. Each normalization method requires one of the 
optional files, as explained above (Fig. 4a). GeneCloudOmics 
plots the normalized data against the raw data as box and violin 
plots (Fig. 4b and c). To create the plots, go to the corresponding 
tab RLE plot tab and violin plot tab, respectively. 


GeneCloudOmics allows performing several statistical analyses on 
the normalized read count data. In this stage, we will demonstrate 
each of them. 


The scatter plot compared the gene expression profile in two differ- 
ent conditions or between replicates. It is performed on the nor- 
malized data. To create a scatter plot in GeneCloudOmics, perform 
data normalization as described above, and then go to the “Scat- 
tered” link in the “Transcriptome Analysis” menu. Choose the two 
conditions or replicates that you want to compare as X-axis and 
Y-axis, choose the log transformation method, and then click the 
plot button (Fig. 5a). The scatter plot will be displayed showing the 
R-value above (Fig. 5b). 


Fig. 2 (continued) SOM). (d) Bioinformatics analysis of gene or protein list that gene ontology (GO) enrichment, 
pathway enrichment, PPI, complex enrichment, gene-/protein-disease association, protein properties, evolu- 
tionary analysis, and protein pathological analysis 


Transcriptomics Data Analytics for Synthetic Biology 235 


Uploaddata: Preprocessing Upload data Preprocessing 
Choose File Type 
® Raw file (read count) Choose File Type 
© Normalised file 


© Raw file (read count) 


Example csv image @® Normalised file 
Choose Raw Counts (required) 


Example csv image 


Choose Normalized Expression (required) 


Example csv image Nof ' 
o hie selectea 
Choose Gene Length (optional) 
| oo Arabi-GeneLength.csv 
Example csv image 


Example csv image Choose Meta Data File (required) 


Choose negative controls (eg. ERCC Spike-in) (optional) 
Example csv image 
Choose Meta Data File (required) 


Arabi-Meta-2.csv 


= 
Example data here 
D Choose Microarray Data 

GEO DATA PREPROCESSING Browse... No file selected 
For Example: GSE153941 
Enter Accession Number 

E 

FILE TYPE GEO DATA PREPROCESSING 
@ RnaSeq SELECT FILE 
© Microarray @ GSE153941_ab1_count_matrix.txt.gz 
© Auto © GSE153941_renca_count_matrix.txt.gz 


Fig. 3 Data upload and import to GeneCloudOmics. (a) Uploading raw RNA-Seg data, (b) uploading normalized 
RNA-Seq data, (c) uploading microarray data (.CEL files), (d) importing data from GEO databases using GEO 
accession, and (e) selecting a dataset from the GEO-imported data 


236 Mohamed Helmy and Kumar Selvarajoo 


Upload data Preprocessing 


A 


Filtering 

Min. value Min. columns 
1 2 

Normalisation method 

-) None (Black) 


RPKM (Gene length input required) 
FPKM (Gene length input required) 
@ TPM (Gene length input required) 
~-) RUV (Negative control genes input required) 


B Preprocessing Rnaseq data 


RLEplot Violin Plot = Datatable © —_ Description table 


Preprocessing Rnaseq data 


Raw Data a “co 
Hy 
i’ 
3 
x z 5 s = z = ¢ 2 
5 5 t 2 
TPM Normalized 
h 
e 
) 8 8 8 H g 5 FI F] F 


Fig. 4 Raw data normalization and plots. (a) Normalization parameters and methods, (b) box plot of raw data 
and normalized data, and (c) violin plot of raw data and normalized data 


Transcriptomics Data Analytics for Synthetic Biology 237 


Transformation: 
None 
Natural log 
log2 

@ \ogi0 


_) Display regression line 


— 


Heatscatter 
Scatter Plot, R= 0.939 


10 KDE 


0.20 
0.15 
0.10 
0.05 


RRL 


CML 


Fig. 5 The scatter plot of gene expression in two conditions. (a) The scatter plot parameters, (b) an example of 
a scatter plot between the gene expression in the wild-type (CM1) and the engineered cells (RR1) 


Distribution Fitting GeneCloudOmics provides six different distribution fitting for sta- 
tistical continuous distributions for gene expression distribution 
comparison. To perform a distribution fitting, perform data nor- 
malization as described above, and then go to the “Distribution” 
link in the “Transcriptome Analysis” menu. Choose from the con- 
dition or replicate that you want to investigate, and then choose the 
statistical distribution(s) from the list of available distributions. You 
can zoom to a range of expressions to investigate its distributions 
(Fig. 6a). After providing all the parameters, click “Plot.” Repeat 
the steps with all conditions or replicates of interests (Fig. 6b, c). 


238 


COF 


Mohamed Helmy a 


Choose a column 
CM1 


Distributions: 
Log-normal 
Log-logistic 
Pareto 
Burr 
Weibull 
Gamma 


Zoom to see fit 
® slider 
text input 


| Pet 2. Download as POF 


Distribution Fit 


AIC table 


Add to report 


nd Kumar Selvarajoo 


T T T T 
le+01 le+02 le+03 le+04 


Expression levels (log) 


le-01 le+00 


T T T T 
le+01 le+02 le+03 le+04 


Expression levels (log) 


Fig. 6 The statistical distribution fitting parameters and plots. (a) The six different statistical continuous 
distributions and the zooming option, (b and c) two distribution fitting results of the gene expression of the 
wild-type and the engineered cells 


Principal Component 
Analysis (PCA) 


GeneCloudOmics enables performing PCA and plots the PCA 
variance, PCA-2D and PCA-3D. To perform PCA on the normal- 
ized data, go to the “PCA” link on the “Transcriptome Analysis” 
menu. Enter the gene sample size, choose the gene sample order 
from the list of provided options, and click “Plot” (Fig. 7a). In this 


Gene sample size 


29542 


Gene sample order (wrt column 1) 

@ Descending (highest to lowest) 
Ascending (lowest to highest) 
Random 


Type of PCA 
@ PCA 
Sparse PCA 


PCA 


PCA variance PCA-2D plot PCA-3D plot 


Transcriptomics Data Analytics for Synthetic Biology 


239 


a EE 
Pcl PC2 PCc3 Pcs 


Pcs PC6 PC7 Pca pce C10 


Fig. 7 The PCA parameters and PCA variance plot. (a) The PCA parameters, (b) the PCA variance plot 


Correlation 


example, we are using the default values. The PCA variance tab 
shows a bar plot of the top ten principal components (PCs) 
(Fig. 7b). 

To plot the PCA-2D and PCA-3D, go to the corresponding 
tabs and enter the parameters. You can choose which PCs to be on 
the X-axis and the Y-axis, the gene sample size, the gene sample 
order, and the number of clusters and display the sample name 
(Fig. 8a). In the PCA-3D plot, you need to choose a third PC 
(Fig. 8b, c). 


GeneCloudOmics enables multiple correlation tests, the Person 
correlation test that measures the linear relationship between the 
gene expression in different conditions or replicates as vectors and 
the Spearman rank correlation that measures the degree of associa- 
tion between the gene expression in different conditions or repli- 
cates as vectors. 


240 Mohamed Helmy and Kumar Selvarajoo 


A X-axis 


PC1 ¥ 
Y-axis 

PC2 Vv 
Gene sample size 

29542 . 


Gene sample order (wrt column 1) 

@ Descending (highest to lowest) 
Ascending (lowest to highest) 
Random 


Kmeans clustering on columns 
Number of clusters: 


’ 


Display sample name 


B 
PCA 


PCA variance PCA-2D plot PCA-3D plot 


10k v ee 
. i 
Ss ae fe 
~ o 
2 Mi 
& = gr Bre 
nN -Sk 
2 
10k oe 
1Sk p31 
20k 15k ~10k -Sk 0 5k 10k 15k 20k 
Cc PC1 - 64.3% 
PCA 
PCA variance PCA-2D plot PCA-3D plot RBI 
4000 
} 6R1 
20° e am 
9 
'c3 9 
ne M 
Ro <> 
sor . “a5 
oF . 5% 
a o% 
P ° “9, ‘sg P 
- + <s, + - 


Fig. 8 The PCA-2D and PCA-3D parameters and plots. (a) The PCA-2D and PCA-3D parameters, (b) the 
PCA-2D plot, and (c) the PCA-3D plot 


Method: 
@ Pearson correlation 
© Spearman correlation 


&, Download as PDF 


() Correlation Heatmap 
©) Correlation Plot 


( Correlation Matrix 


Cc 


Pearson correlation 


Correlation heatmap Correlation plot 


$385 8 ga & 


Transcriptomics Data Analytics for Synthetic Biology 


Pearson correlation 


Correlation heatmap 


cMi 
cM1i 1.00000 
CM2 0.97504 
CRI 0.97960 
CR2 0.97899 
RMi =: 0.97859 
RM2 0.95462 
RRI 085615 
RR2 074951 
RBI 0.88572 
RB2 0.92160 


Correlation matrix 


& @ & 


Fig. 9 The Pearson correlation analysis. (a) The method selection section, (b) the correlation matrix, (c) the 
Correlation heatmap plot, and (d) the correlation plot 


Pearson Correlation 


Correlation plot 


Correlation matrix 


cmM2 cRi cR2 RM1 RM2 
0.97504 0.97960 0.97899 0.97859 0.95462 
1.00000 0.95802 0.98609 0.94441 0.97631 
0.95802 1.00000 0.94132 0.99426 0.90469 
0.98609 0.94132 1.00000 0.93968 0.98323 
0.94441 0.99426 0.93968 1,00000 089982 
0.97631 0.90469 0.986323 089982 1.00000 
0.85295 086116 081444 082935 082538 
0.78956 0.71737 0.75008 0.67738 0.80396 
0.86618 0.94777 081718 0.93059 0.78460 
0.95089 0.88374 0.93786 0.86563 0.96414 

D 

Pearson correlation 


Correlation heatmap 


RRL 

085815 
085295 
086116 
081444 
082935 
082538 
1.00000 
0.94007 
089552 
0.91091 


RR2 

0.74951 
0.78956 
0.71737 
0.75008 
0.67738 
0.80396 
0.94007 
1.00000 
0.73653 
0.90685 


Correlation plot Correlation matrix 


2828 


1S) 
u@® 
000, 
a tt 
=@ee 
a 
® 


7) Ve 


2 
= 
ce 


RBI 

0.88572 
0.86618 
0.94777 
081718 
0.93059 
0.78460 
0.89552 
0.73653 
1.00000 
083216 


241 


RB2 

0.92160 
0.95089 
0.88374 
0.93786 
0.86563 
0.96414 
0.91091 
0.90685 
083216 
1.00000 


To perform a Pearson correlation analysis, normalize the data as 
described above, and then go to the “Correlation” link in the 
“Transcriptome Analysis” menu. Then, in the “Method” section, 
choose “Pearson correlation” and then click “Plot” (Fig. 9a). The 
correlation is plotted as a heatmap or a correlation plot or displayed 
as a correlation matrix (Fig. 9b-d). Each of the outputs can be 
accessed through the corresponding tab. 


242 Mohamed Helmy and Kumar Selvarajoo 


A 


Method: 
Pearson correlation 


@ Spearman correlation 


(_) Correlation Heatmap 
() Correlation Plot 


() Correlation Matrix 


Cc 


Spearman correlation 


§ 3 


Spearman correlation 


Correlation heatmap Correlation plot Correlation matrix 


CMi cmM2 cCRi cR2 RM1 RM2 RRL RR2 RBI RB2 
CM1 100000 0.949686 0.96683 0.95782 0.96615 093614 0.92096 O88817 094043 0.93136 
CM2 0.94986 100000 095236 0.96542 0.95385 097736 0.90860 0.92743 0.92576 0.96063 
cRi 096683 0.95236 100000 096219 096760 0.93811 092281 O89014 094433 0.93344 
CR2 0.95782 0.96542 0.96219 100000 095867 0.95137 091481 0.90133 0.93406 094411 
RM1 (0.96615 0.95385 0.96760 0.95867 100000 094350 0.92632 0697634 0.94466 0.93936 
RM2 0.93614 0.97736 0.93811 0.95137 0.94350 1.00000 0.90301 0.93601 0.91583 0.96033 
RRi 0.92096 §=60.90840 0s «0.92281 0.91481 §=—0.92632 «0.90301 100000 0.94513 0.95486 0.94652 
RR2 088817 0.92743 089014 090133 089634 093601 094513 100000 092712 0.96388 
RB 0.94043 0.92576 0.94433 093406 094466 0.91583 0.95686 0.92712 100000 0.95336 


RB2 0.93136 0.96083 0.93344 094411 0939736 0.96033 0.94652 0.96388 0.95336 100000 


D 


Spearman correlation 


Correlation heatmap Correlation plot Correlation matrix Correlation heatmap Correlation plot Correlation matrix 
=o og 2 2 
ze ¢ PS #22 #2 FF # & Bb 8B 
& & 6o 060 6 &®& & e& eee 
1 
~@ @0@ e 
: 090 
~“@ © Oreece 
0.98 
u@ee e- 
ost 
=@ ee 
096 
“@ °- 


Fig. 10 The Spearman correlation analysis. (a) The method selection section, (b) the correlation matrix, (c) the 
correlation heatmap plot, and (d) the correlation plot 


Spearman Correlation 


Noise Analysis 


To perform the Spearman correlation analysis, normalize the data 
as described above, and then go to the “Correlation” link in the 
“Transcriptome Analysis” menu. Then, in the “Method” section, 
choose “Spearman correlation” and then click “Plot” (Fig. 10a). 
The rest of the analysis and the plot are similar to the Pearson 
correlation (Fig. 10b—d). 


GeneCloudOmics computes transcriptome-wide average noise for 
each replicate/condition to quantify between gene expressions 
scatter of all replicates in one experimental condition. To perform 
the noise analysis, normalize the data as described above, and then 


Shannon Entropy 


3.4.5 Differential Gene 
Expression Analysis 


DE Analysis 


Heatmap 


Transcriptomics Data Analytics for Synthetic Biology 243 


go to the “Noise” link in the “Transcriptome Analysis” menu. 
Then, select the “Anchor genotype,” which will be used for the 
comparison, select the desired plot options, and then click “Plot” 
(Fig. lla). The noise can be plotted as a bar chart (Fig. 11b) ora 
line chart (Fig. llc). 


To measure the disorder of the transcriptomic data as a high- 
dimensional system, GeneCloudOmics computes Shannon entropy 
for each sample (condition or replicate). The higher Shannon 
entropy values indicate an increasing disorder. To perform the 
Shannon entropy analysis, normalize the data as described above, 
and then go to the “Entropy” link in the “Transcriptome Analysis” 
menu. Then, select if your data is time-series data or not and click 
“Plot” (Fig. 12a). The Shannon entropy can be plotted as a bar 
chart (Fig. 12b) or a line chart (Fig. 12c). 


GeneCloudOmics provides an interface for three of the most used 
DE analysis methods (EdgeR, DESeq2, and NOJISeq). In this 
tutorial, we will demonstrate how to perform DE analysis using 
EdgeR since it supports the generation of all plots supported by 
GeneCloudOmics, the volcano plot and the dispersion plot 
(Fig. 13). The volcano plot shows the statistical significance of the 
p-value in relation to the fold change in the gene expression, while 
the dispersion plot quantifies the variance that deviates from the 
mean. 

To perform a DE analysis using GeneCloudOmics, normalize 
the data as described above, and then go to the “DE Analysis” link 
in the “Transcriptome Analysis” menu. Choose the DE analysis 
method from the provided list (here, we will choose “EdgeR”), 
and select the number of replicates in your data (single or multiple) 
(Fig. 13a). DE analysis is performed as a comparison between two 
conditions; hence, you need to choose the two conditions to be 
compared. Here, we are using the wild-type and the engineered 
cells. Finally, you need to provide the DE criteria, the false discov- 
ery rate (FDR), and the minimum fold change, and then click 
“Plot” (Fig. 13a). 

Once the execution is done, the list of DEG, their statistical 
significance, and fold change will be available for download as a 
CSV file. The volcano plot and the dispersion plot can be generated 
by clicking the corresponding tabs (Fig. 13b, c). Both plots can be 
downloaded in a PDF format. 


Heatmaps represent the variance in gene expression using color 
intensity and help visualize clusters of genes with similar expression 
profiles. The map is a grid where each row is a gene and each 
column is a sample (condition or replicate). To perform the heat- 
map analysis in GeneCloudOmics, normalize the data as described 
above, and then go to the “Heatmap” link in the “Transcriptome 


244 Mohamed Helmy and Kumar Selvarajoo 


A 


Select desired noise plot between 
© replicates 
© genotypes (average of replicates) 
@ genotypes (no replicate) 
Anchor genotype 
CM1i + 


Graph type: 
@ Bar chart 
© Line chart 


0.4 


03 
0.2 
O12 

0. 


M2 cRi CR2 RML RM2 RRL RR2 RBI RBZ 


Fig. 11 The noise analysis parameters and plots. (a) The noise analysis parameters, (b) the noise plotted as a 
bar chart, (c) the noise plotted as a line chart 


Transcriptomics Data Analytics for Synthetic Biology 245 


() Time series data 


Graph type: 
@ Bar chart 
© Line chart 


B 
Shannon entropy 
0.14 
0.12 
O12 
0.08 
0.06 
0.04 
0.02 
o cM o™2 cRI CR2 RB2 
Cc 
Shannon entropy 
0.14 
0.12 
o1 
0.08 
0.06 
0.04 
cma M2 CRI cR2 RM1 RM2 RRL RR2 Ret RB? 


Fig. 12 The Shannon entropy analysis parameters and plots. (a) The Shannon entropy analysis parameters, (b) 
the entropy plotted as a bar chart, (c) the entropy plotted as a line chart 


Replicates? 
@ Multiple 
© Single 


DE Method 
@ EdgeR 

> DESeq2 
© NOISeq 


Choose 2 experiment conditions for DE analysis 
Condition 1 


t1 ™ 


Condition 2 


t4 ~ 


DE criteria 
FDR Fold Change 


0.05 2 


5 == 


B DE Analysis 
DE genes Volcano plot Dispersion plot 


Volcano plot is onty available for edgeR and DESeq2 methods 


4ogiOxPValue) 


C_DEAnalysis 
DE genes Volcano plot Dispersion plot 


Dispersion plot is only available for edgeR and DESeq2 methods 


Quarter-Roct Mean Deviance 


05 10 15 20 25 30 35 


Average Log2 CPM 


Fig. 13 The DE analysis methods, parameters, and plot. (a) The DE analysis methods and parameters, (b) the 
volcano plot, and (c) the dispersion plot. Both plots are only available for EdgeR and DESeq2 DE analysis 
methods 


Transcriptomics Data Analytics for Synthetic Biology 247 


A Choose data 
Indenpendent 


@ DE result 
Number of clusters on rows 


4 


| Pot | & Download as PDF 


Heatmap 


_| Gene Clusters 
Add to report 


Heatmap 


Heatmap Gene clusters 


a 7] = oS =) sy = ~ 
= = = = = o o 
ra) = = = = = = 


Fig. 14 The heatmap parameters and plot. (a) The heatmap parameters and (b) the heatmap plot 


Analysis” menu. Then select if you want to create a heatmap plot 
for all genes or the DEG only and the number of clusters, and then 
click “Plot” (Fig. 14a). The heatmap plot will be displayed in the 
“Heatmap” tab (Fig. 14b). The genes of each cluster can be down- 
loaded as a CSV file from the “Gene Clusters” tab. 


Self-Organizing Map (SOM) + =‘The SOM analysis is a dimensionality reduction approach to reduce 
the complexity of the gene expression data. To perform the SOM 
analysis in GeneCloudOmics, normalize the data as described 
above, and then go to the “SOM” link in the “Transcriptome 
Analysis” menu. Choose if you want to perform the analysis using 
one sample or all samples. Then enter the number of horizontal and 
vertical grids, the number of clusters, and the log transformation, 
and then click “Plot” (Fig. 15a). The SOM analysis of GeneClou- 
dOmics provides five plots: (1) the property plot, (2) the count 
plot, (3) the cluster plot, (4) the distance plot, and (5) the code plot 
(Fig. 15b-f). 


248 Mohamed Helmy and Kumar Selvarajoo 


Property 


Samples used 
All z 


No. of horizontal grids No. of vertical grids 


2 2 


No. of clusters (for cluster plot) 


Count 


cc D Clusters 


E F Codes 


Distance 


Fig. 15 The self-organization map (SOM) analysis parameters and plots. (a) The SOM analysis parameters, (b) 
the property plot, (c) the count plot, (d) the cluster plot, (e) the distance plot, and (f) the code plot 


t-Distributed Stochastic 
Neighbor Embedding (t- 
SNE) 


3.4.6 Bioinformatics and 
Functional Annotations of 
the DEG 


Gene Ontology (GO) 
Association Analysis 


Transcriptomics Data Analytics for Synthetic Biology 249 


t-SNE is another dimensionality reduction approach that reduces 
the high-dimensional gene expression data to two or three dimen- 
sions. To perform the t-SNE analysis in GeneCloudOmics, normal- 
ize the data as described above, and then go to the “t-SNE” link in 
the “Transcriptome Analysis” menu. Then enter the perplexity 
value, the number of principal components (PCs), and the number 
of clusters. Next, choose if you want to log transform the data or 
not, and click “Plot” (Fig. 16a). The t-SNE plot and the t-SNE 
table are shown in the corresponding tabs (Fig. 16b, c). 


The DE analysis produces a list of DEGs that causes the difference 
in phenotype between samples (conditions or treatments). To 
understand the biology behind this difference, this list of genes 
needs to be annotated and interpreted. Most of the available DE 
analysis tools do not provide bioinformatics tools for gene list 
functional analysis and interpretation or provide basic analysis 
such as GO annotation [28]. GeneCloudOmics provide access to 
11 different bioinformatics tools for the analysis of gene and pro- 
tein lists (Fig. 2). The GeneCloudOmics bioinformatics section is 
designed to be used independently from the DE analysis and the 
biostatistical sections. Thus, a gene or protein list that results from 
any analysis can be analyzed using the GeneCloudOmics bioinfor- 
matics section. 

In this section of the tutorial, we will use the list of DEGs 
resulting from the above analysis, and perform several functional 
annotations. For demonstration purposes, we will use the 50 most 
significant DEG genes from the list. The genes on the list use the 
Arabidopsis mRNA IDs that are not supported by GeneCloudO- 
mics. Therefore, we used the ID converter of g:Profiler to convert 
them to UniProt ID. 


The GO association analysis fishes and sorts the GO terms asso- 
ciated with the genes in the gene list (after ID conversion to 
UniProt protein IDs) and creates three GO plots. To perform 
GO association analysis, go to the “Protein Set Analysis” in the 
main menu, click “Gene Ontology,” and then upload the list of 
UniProt ID as a CSV file or, alternatively, paste the list of IDs in the 
designated text box (Fig. 17a). The paste option supports different 
types of delimiters including space, tab, comma, and new line. 
Therefore, the IDs can be directly copied from spreadsheet software 
(e.g., Microsoft Excel or Google Spreadsheets) or other media such 
as text files. Once the upload/paste is complete, click the “Submit” 
button. GeneCloudOmics connects to UniProt, downloads the 
GO associations, and then creates the GO terms biological process 
plot (Fig. 17b), the GO terms molecular function plot (Fig. 17c), 
and the GO terms cellular compartment plot (Fig. 17d). The 
results can also be downloaded as CSV files for further analysis or 
to be imported to another tool. 


250 Mohamed Helmy and Kumar Selvarajoo 


A 


Perplexity value No. of PCs 
2 45 
Transformation: 
@ None 
) logiO 


Kmeans clustering on columns 


Number of clusters: 


4 
Display sample name 


B_tSNE Plot 


t-SNE plot t-SNE table 


100 3 
4 
50 
y (O ~ 
w 
2 
a 
- ~50 
100 
150 
200 
40 20 C) 20 40 60 80 
TSNE1 
C_t-SNE Plot 
t-SNE plot t-SNE table 
show vet sexe: [—] 
TSNE1 TSNE2 Sample cluster 
i ~31.1983959251036 ~187.783095010554 CMi 4 
2 49.9208529412919 121.55580380134 CM2 2 
3 ~44.2187504455972 ~161814153193995 CR1 4 
4 76.6849028244454 115.093274214332 CR2 2 
5 -20.6177832787307 ~157.65690984366 RM1 4 
6 36.7220080807263 142011170316322 RM2 2 
7 ~4.747 4445843969 2.72878578109853 RRi 1 
8 +3.20649590789878 60.5403200429965 RR2 3 
9 -43.1694272671633 ~17.3298449379107 RBI 1 
10 13.8305335624268 82.6546488300306 RB2 3 
Showing 1 to 10 of 10 entries Previous 1 Next 


Fig. 16 The self-organization map (SOM) t-distributed stochastic neighbor embedding (t-SNE) analysis 
parameters and outputs. (a) The t-SNE analysis parameters, (b) the t-SNE plot, and (c) the t-SNE table 


Transcriptomics Data Analytics for Synthetic Biology 251 


A Example here 
Upload UniProt accession CSV file 


Enter UniProt accession csv file 


Biological Process 


«| 


Protein count 


aminocyclopropane-1 carbon ylate synthase actvi 


sequence-specifie ONA bends 


Molecular Function 


Protein count 


Cellular component 


Protein count 


Fig. 17 The gene ontology (GO) association analysis parameters and outputs. (a) The GO association analysis 
parameters, (b) the biological processes plot, (c) the molecular functions plot, and (d) the cellular 


compartment plot 


252 Mohamed Helmy and Kumar Selvarajoo 


Pathway Enrichment 
Analysis 


Protein-Protein Interactions 


Protein Functions and 
Subcellular Localization 


Protein and Gene 
Properties 


Pathway enrichment analysis identifies the pathways that are 
enriched in each list of genes or proteins. This helps in understand- 
ing the cellular pathways and biological functions affected by the 
differential expression of those genes. GeneCloudOmics allows 
performing this analysis using gene names or protein UniProt 
IDs. To perform pathway enrichment analysis using gene names 
or protein IDs, go to the “Gene Set Analysis” or “Protein Set 
Analysis,” respectively, click “Pathway Enrichment,” and then 
upload or paste the list of gene names or UniProt ID as described 
above. Then select the plot style and layout, and choose the mini- 
mum overlap between the query and the pathways (the minimum 
number of the query genes to be in the pathway) (Fig. 18a). The 
pathway enrichment is performed using g:Profiler [46], and the 
results are plotted in as an enrichment plot (Fig. 18b) and can be 
downloaded in CSV format. 


GeneCloudOmics enables downloading and plotting all protein- 
protein interactions (PPI) associated with the given set of proteins 
(UniProt IDs). The PPI is visualized as a network using Cytoscape. 
JS [47] and can also be downloaded as an interaction table. To 
perform PPI analysis, go to the “Protein Set Analysis” in the main 
menu, click “Protein Interactions,” and then upload or paste the 
list of UniProt ID as described above (Fig. 19a). GeneCloudOmics 
connects to UniProt and downloads the PPIs associated with each 
of the provided proteins and then plots the interactions as a net- 
work (Fig. 19b). The style of the network (node and edge appear- 
ance and colors) can be changed from the “Select Style” menu, and 
the network layout can be changed from the “Select Layout” menu 
(Fig. 19a). GeneCloudOmics provides five and ten different net- 
work styles and layouts, respectively. 


Protein functions and subcellular localizations can provide useful 
information on the set of proteins under investigation. GeneClou- 
dOmics provides two tools to access this information from the 
UniProt Knowledgebase. To get the functions or the subcellular 
locations of your proteins, go to the “Protein Set Analysis” in the 
main menu, and click “Protein Functions” or “Subcellular Locali- 
zation,” respectively. Next, upload or paste the list of UniProt ID as 
described above. GeneCloudOmics connects to UniProt and 
downloads the functional and localization annotations associated 
with each of the provided proteins and displays them in tabular view 
(Fig. 20). The results can be downloaded in a CSV format as well. 


GeneCloudOmics also provides access to tools that investigate 
different properties of the proteins including protein physicochem- 
ical properties, sequence properties, and evolutionary properties 
(Helmy 2021). To investigate the physicochemical properties of 
your proteins, go to the “Protein Set Analysis” in the main menu, 


Transcriptomics Data Analytics for Synthetic Biology 253 


A Example here 


Upload genes CSV file 
| Browse | No file selected 
Enter gene id 
Q940V4 064989 Q38967 QIFK 
Select Style: 
generic style v 
Minimum Overlap 
C) 100 
bs pn 
om mew 0 8 OO 7 Hw 10 
Select Layout: 
+ 
B exeenane e 
— * 
a . 
—o . 
os 8 -_ 
elie ed . __ 
o- 
a ° om 
o_o 
ee re . oem 
. ed 
. 
- — . — 
ee a on 
. 
— ° en 
- 
| Ne oe oe . ~~ 
e- 
ns ° e-- 
- 
——-— , hoe 
~~ 
a i $ po 
SS eee tae ee a 
_—_— 
Se ee ° an 
[_—— 
——— a © 
www oe ae . 
ae 
—_——_—O — 
————— oS 
; ‘ 
es owe 
KEGG stats 
~~) Term name Term ID a l. ~logy0 (Pra) —- T Q TAQ U 
"] Brassinosteroid biosynthesis KEGG:00905 er sr) 108 6 4970 | 
_] Biosynthesis of secondary metabolites KEGG:01110 3.987x10°7 ea 1236 108 54 4970 
_] Phenylpropanoid biosynthesis KEGG:00940 5.922x10°¢ a) 176 108 7 4970 
_) Metabolic pathways KEGG:01100 2.696x10-* Ez 2273 108 72 4970 


Fig. 18 The pathway enrichment analysis parameters and outputs. (a) The pathway enrichment analysis 
parameters and (b) the pathway enrichment plot 


254 Mohamed Helmy and Kumar Selvarajoo 


A Example here 
Upload UniProt accession CSV file 


No file selected 


Enter UniProt accession numbers 


Q940V4 064989 Q38967 QIFKB1 048766 QISUS? QISIUS QBW4S6 QIFMA 


Select Style: 


Select Layout: 


B Protein-Protein Interactions 


Visualization Protein Interaction Protein Name 


@e?® C@6 
e° *e 
e @ 


Fe 


& 


@ 
@ 
@ 
6 
@ 
@ 
@ 
. ® 
@ 

@ 


@ 

@ 

) @ 
@ 
$s e 

éoa) 

"Se @ 
@ 

@ 

@ 


e 'e Pd 


@ @ 


Fig. 19 The protein-protein interaction (PPI) analysis parameters and outputs. (a) The PPI analysis input and (b) 
the PPls visualized as a network with the circle layout 


Transcriptomics Data Analytics for Synthetic Biology 255 


> 


o~ 


Function 


FUNCTION: Catalyzes the C6-oxidation step in brassinosteroids biosynthesis. Converts 6-deoxocastasterone to castasterone, and castasterone to 
Q940V4 (No Match) brassinolide. May also convert 6-deoxoteasterone to teasterone, 3-dehydro-6-deoxoteasterone to 3-dehydroteasterone, and 6-deoxotyphasterol to 
typhasterol. [(ECO:0000269|PubMed:12529536} 


ro 


2 064989 (No Match) FUNCTION: Catalyzes the C22-alpha-hydroxylation step in brassinosteroids biosynthesis. Converts campestanol to 6-deoxocathasterone and 6- 
oxocampestanol to cathasterone. 

FUNCTION: Amino acid-proton symporter. Stereospecific transporter with a broad specificity for histidine, arginine, glutamate and neutral amino 
acids, favoring small amino acids such as alanine, asparagine and glutamine. Accepts also large aromatic residues such as in phenlalanine or tyrosine. 
Has a much higher affinity for basic amino acids as compared with AAP1. May function in xylem-to-phioem transfer and in uptake of amino acids 
assimilated in the green silique tissue. (ECO:0000269|PubMed:7608199, ECO:0000269|PubMed:8281191). 


3 Q38967 (No Match) 


4 Q9FK81 (No Match ) FUNCTION: Involved in stress response. (ECO:0000305). 


5 048766 (No Match) 


Q9SUS9 (No Match ) FUNCTION: Possesses protease activity in vitro. [ECO:0000269|PubMed:23460027]. 


FUNCTION: 1-aminocyclopropane-1-carboxylate synthase (ACS) enzymes catalyze the conversion of S-adenosyt-L-methionine (SAM) into 1- 


7 9S9U6 (No Match) 
Q Sane aminocyclopropane-1-carboxylate (ACC), a direct precursor of ethylene. 


FUNCTION: 6 and 1-fructan exohydrolase that can degrade both inulin and levan-type fructans, such as phlein, levan, neokestose, levanbiose, 6- 


IW4S6 (No Match 
8 SAAMRAIEE —\costose, 1-kestose, inulin, and 1,1-nystose. (ECO:0000269|Ret 5}. 


FUNCTION: Catalyzes the C6-oxidation step in brassinosteroids biosynthesis. Converts 6-deoxocastasterone to castasterone. May also convert 6- 
9 Q9FMAS (No Match) deoxoteasterone to teasterone, 3-dehydro-6-deoxoteasterone to 3-dehydroteasterone, and 6-deoxotyphasterol to typhasterol. 
{ECO:0000269| PubMed:11402205, ECO:0000269| PubMed: 12529536}. 


10 Q07488 (No Match) FUNCTION; Probably acts as an electron carrier. 


Showing 1 to 10 of Si entries Previous 1 2 3 4 5 6 Next 


ID Subcellular.Location 
1 SUBCELLULAR LOCATION: Membrane (ECO:0000305); Single-pass membrane protein (ECO:0000305}. 
2 SUBCELLULAR LOCATION: Membrane (ECO:0000305); Single-pass membrane protein (ECO:0000305}. 
3 SUBCELLULAR LOCATION: Cell membrane {ECO:0000305}; Multi-pass membrane protein (ECO:0000305). 
4 Q9FK81 (No Match) 
5 048766 (No Match) SUBCELLULAR LOCATION: Secreted (ECO:0000250}. 


Q9SUS9 (No Match ) 


o 


7 Q9S9U6 (No Match) 


SUBCELLULAR LOCATION: Secreted, extracellular space, apoplast (ECO:0000305). Secreted, cell wall (ECO:0000305}. Note»Associated to the cell 


ames COW4S6 (No Match) BRIE terest yh 


Q9FMAS (No Match) SUBCELLULAR LOCATION: Membrane (ECO:0000305); Single-pass membrane protein [ECO:0000305}. 


10 Q07488 (No Match) SUBCELLULAR LOCATION: Cell membrane; Lipid-anchor, GPI-anchor. 


Sy 


Showing 1 to 10 of 51 entries Previous i 2 3 4 5 6 Next 


Fig. 20 The protein functions and subcellular localization outputs. (a) The PPI analysis input and (b) the PPls 
visualized as a network with the circle layout 


256 Mohamed Helmy and Kumar Selvarajoo 


3.4.7 Result 
Interpretation 


click “Protein Properties,” then upload or paste the list of UniProt 
ID as described above, and click “Submit.” This feature performs 
three different protein property analyses: (1) protein sequence 
charge, (2) protein sequence acidity and GRAVY index, and (3) pro- 
tein hydrophobicity. Each of them can be accessed through the 
corresponding tab. In addition, the “All physicochemical proper- 
ties” provides a combined analysis of all of them (Fig. 21a). 

The evolutionary analysis provided by GeneCloudOmics 
includes protein’s gene tree, chromosomal location, and protein’s 
phylogenetic tree. To access the protein evolutionary analysis tools, 
go to the “Protein Set Analysis” in the main menu, click “Evolu- 
tionary Analysis,” then upload or paste the list of UniProt ID as 
described above, and click “Submit.” Each of these analyses can be 
accessed through the corresponding tab. Here, we show the output 
of the phylogenetic tree analysis that performs multiple sequence 
alignment (MSA) and then creates the phylogenetic tree (Fig. 21b). 

In all the gene and protein set analysis, you do not need to 
upload your gene or protein list every time you use a new analysis. If 
you are to use the same list in multiple analyses, upload the list 
through the “Upload a Protein List” in the “Protein Set Analysis” 
menu. The uploaded list will be kept until you finish your analysis 
and close the session. When moving from one analysis to the other, 
your protein list will always be pasted in the text box and ready for 
the next analysis. 


In this protocol, we used transcriptomic data from wild-type and 
engineered Arabidopsis cells with ten samples of five different con- 
ditions consisting of wild-type that is untreated (M) or treated with 
rapamycin (Rap) and engineered cells that are untreated (M) or 
treated with rapamycin (Rap) or brassinolide (BL) with two repli- 
cates per condition [32]. We demonstrated how GeneCloudOmics 
can be used to perform transcriptomic data analysis for synthetic 
biology research by preprocessing the data obtained from the GEO 
database, performing several biostatistical tests, identifying the dif 
ferentially expressed genes (DEG) between two different condi- 
tions, and analyzing the list of the DEGs using multiple 
bioinformatics tools. For sample-level analysis, we choose two 
samples: (1) the untreated wild type (CM1) and (2) the 
rapamycin-treated engineered cells (RR1). 

Firstly, the preprocessing of the raw data showed that the data 
was partially preprocessed as the box plots of the raw data and the 
TPM normalized data show similar profiles (Fig. 4b). However, the 
violin plots of the same two data show reduced outliers and better 
distribution of the normalized data (Fig. 4c). To determine the 
expression threshold for low-count filtering, we used the 
transcriptome-wide distribution fitting for each of the two selected 
samples. The transcriptome-wide distribution fitting with six 


Transcriptomics Data Analytics for Synthetic Biology 


A Sequence Charge Sequence Acidity Hydrophobicity 
“ne 
10 
& Bw 
a Groups 2 51% 
8 49% 
A = H Negauve g 
Posawe 
2 E 


a Neguve Bostwe Groups 
Groups Acide 
GRAVY index Basic 


Sequence GRAVY index 


257 


Hydrophobicity 


§ 5 » 
eer < 
< 3 
& & na 
’ 16% 
0 Ey 
Negative Pore 
Groups 
B Phylogenetic Tree of proteins 
2 g © 
2 % 5 & 
% sg 2 
* r Fy 
@ S 
a * f ¢ 
‘%S, P se 
Rey, 9 
d 
Oey, 4 @ 
re, 
: or" 
Pray 
ops? 
049561 
QOF KX. 
79 
axe 
OOM ICG 
QOL 
PIC7ry 
al ° 
, “hey 
oe os 
; Ses 
qr Rb 
so Be 
& ; Rp 
& %, 


Fig. 21 The protein properties and evolutionary analysis outputs. (a) All the protein physicochemical property 
plot and (b) the protein phylogenetic tree plot 


258 


Mohamed Helmy and Kumar Selvarajoo 


different statistical distribution models shows a cutoff of five since 
the larger counts follow all the six statistical distributions 
(Fig. 6b, c). 

Next, we used several biostatistical tests to investigate the 
global relationship between samples, either all samples or pairwise. 
The width of the scatter plot can be used to visualize the variation 
between two samples and to show the amount of toggle genes 
[Sandro 2022]. For the two selected samples, the width of the 
scatter plot shows variability in the expression of several genes, 
and the genes on the two axes are the toggle genes (Fig. 5b). The 
PCA variance plot shows PC1 and PC2 as the main components 
(Fig. 7b). The 2D-PCA clustered all the samples into two clusters 
and shows considerable variations between replicates (Fig. 8b), 
while the 3D-PCA shows less variation between replicates and the 
brassinolide-treated engineered cells (RB1) as the most variable 
sample (Fig. 8c). 

The Pearson correlation and Spearman correlation analyses 
show the high correlation between replicates of the same condition, 
illustrated by the correlation heatmaps (Figs. 9c and 10c) and the 
correlation plots (Figs. 9d and 10d). Furthermore, the wild-type 
samples (treated and untreated) show a considerable correlation 
with the untreated engineered cells but not with the treated engi- 
neered cells. The noise analysis shows a noise level that is around 
0.3 for all samples with a significant increase for the rapamycin- 
treated engineered cells (RR1J and RR2) (Fig. 1 1b, c). While all the 
samples have a Shannon entropy of 0.6 on average, the same two 
samples of the rapamycin-treated engineered cells show an elevated 
Shannon entropy of 0.1 and 0.14, respectively (Fig. 12b, c). 

The differential expression (DE) analysis between the two 
selected samples was performed using the EdgeR method with a 
minimum of twofold expression threshold change and an FDR of 
0.05. The analysis resulted in 3325 DE genes. The results of the 
DRE analysis were visualized using the volcano plot which shows 
the relationship of the p-value and the expression fold difference for 
every gene (Fig. 13b) and the estimation of gene-wise dispersion 
and empirical shrinkage of these estimates to produce a more 
accurate dispersion estimate for actual gene count modeling 
(Fig. 13c). All the DEGs were then analyzed by the heatmap gene 
clustering feature to identify groups of genes with similar patterns 
of gene expression change between samples (Fig. 14b). The analysis 
shows similar global expression patterns in the treated and 
untreated wild-type cells and the untreated engineered cells (CM, 
CR, and RM), while the rapamycin-treated engineered cells (RR1 
and RR2) and the brassinolide-treated engineered cells (RB1 and 
RB2) show two significantly different patterns (Fig. 14b). Gene 
wise, four common expression patterns were observed: (1) genes 
with decreased expression in the CM, CR, and RM samples and 


Transcriptomics Data Analytics for Synthetic Biology 259 


increased expression in the other two samples (RR and RB) (Group 
1); (2) genes with decreased expression in the CM, CR, and RM 
samples, increased expressed in the RB samples, and_ highly 
increased expression in the RR samples (Group 2); (3) genes with 
decreased expression in the CM, CR, and RM samples, very 
decreased expression in the RR sample, and increased expression 
in the RB samples (Group 3); and (4) gene with increased expres- 
sion in the CM, CR, and RM samples and with decreased expres- 
sion in the RB sample and very decreased expressed in the RR 
samples (Group 4) (Fig. 14b). 

The list of DEGs was then analyzed using GeneCloudOmics 
bioinformatics features. Since GeneCloudOmics supports gene 
names and UniProt IDs only and the gene list was in Arabidopsis 
transcript IDs, the ID converted of g:Profiler was used to convert 
the Arabidopsis transcript IDs to UniProt IDs. 

The gene ontology (GO) biological process analysis of the top 
50 DEGs shows the enrichment of several brassinosteroid-related 
processes and signaling pathways as well as several metabolic pro- 
cesses related to the cellular response to different stimuli (Fig. 17b). 
The GO molecular function analysis for the same genes shows 
enrichment with functions related to metal ion binding and differ- 
ent hormonal activities (Fig.17c). The GO cellular compartment 
shows that most of the top 50 proteins are from the membrane, 
extracellular, or cell wall areas (Fig. 17d). The protein-protein 
interactions (PPI) feature in GeneCloudOmics was used to obtain 
the PPI associations with the same set of proteins. The analysis 
shows that only a few of those proteins are associated with known 
interactions (Fig. 19b). For pathway enrichment analysis, we used 
the top 100 genes that show enrichment of the brassinosteroid and 
phenylpropanoid biosynthesis pathways and the biosynthesis of 
secondary metabolites (Fig. 18b). 

Investigating the protein’s physicochemical properties shows 
that the proteins that correspond to the top 50 DEGs that are 
charged with a negative charge are more than those with a positive 
charge. In terms of acidity, they are balanced with almost half of the 
protein residues being acidic and half being basic (Fig. 21a). The 
GRAVY index, which indicates the hydrophobicity of the proteins, 
shows that 84% of the proteins are negative (hydrophobic) 
(Fig. 21a). The evolutionary analysis feature presented in Gene- 
CloudOmics is demonstrated by the creation of a phylogenetic tree 
for the top 50 proteins. GeneCloudOmics downloaded the 
sequences, performed MSA, calculated the phylogenetic relation- 
ships between the proteins, and then created and plotted the tree 
(Fig. 21b). 


260 Mohamed Helmy and Kumar Selvarajoo 


4 Notes 


. The metadata file is a required file to run any transcriptomic 


data analysis. It is not usually included with the data. It can be 
easily created in a spreadsheet or a text editor as a table with two 
columns where the first column lists the sample names, exactly 
as in the read count file, while the other column lists the 
relationship between them (time points or treatments). 


. The entries in the second column in the meta-file must start 


with the small letter “t.” 


. The gene length file is required for certain normalization meth- 


ods (see above). This file lists the gene (or transcript) IDs and 
the length of the gene. This file needs to be created, if not 
included, in the source of the data. 


. To create the gene length file, download the genome annota- 


tion of the organism from Ensembl database or other 
specialized databases. The genome annotation is usually in 
the gene-finding format or the generic feature format (GFF). 
The GFF format lists the genome annotation features in a 
tabular format of nine columns. Columns 1, 3, 4, and 5 are 
the important columns to create the gene length file. Open the 
file using a spreadsheet software such as Microsoft Excel and 
filter column 3 (the feature column) selecting the features 
annotated as “mRNA.” Then, add a column called “Length” 
where its contents will be equal to column 5-column 4 (the 
mRNA end-the mRNA start). Then copy columns 1 (the 
sequence columns) and the new column length to a new 
sheet, and rename column 1 to “Gene.” You will have a table 
with two columns, gene and length. 


. The GFF files are usually large file with thousands of rows. In 


most of the cases, the spreadsheet methods might not work 
since the file size will be bigger than what Excel can handle. 
Thus, the use of programming languages, such as R and 
Python, will be the only way to process the GFF file and create 
the gene length file. 


. GeneCloudOmics is a webserver that performs all the analyses 


on-demand, and it does not store any data. The data size and 
the query length play an important role in the time that the 
analysis will take especially in the analyses that GeneCloudO- 
mics performs through connecting to other sources, such as 
UniProt Knowledgebase. Therefore, a very long gene list will 
initiate a very long query which will take long time that can be 
longer than the session expiry time. If you encountered such 
situation, reduce the number of your genes/protein but run- 
ning the analysis multiple times, each of them with a subset of 
the genes/protein. 


5 Conclusions 


Acknowledgment 


References 


1. Dasgupta A, Chowdhury N, De RK (2020) 


Transcriptomics Data Analytics for Synthetic Biology 261 


Gene expression (transcriptome) analysis is a key technique that 
provides a mechanistic understanding of biological systems. It has 
wide applications in synthetic biology and metabolic engineering. 
Performing the differential expression analysis requires an adequate 
background in bioinformatics, biostatistics, and computer pro- 
gramming which makes it challenging for bench scientists to 
exploit the full potential of this technique. We developed Gene- 
CloudOmics, a cloud-based server that performs transcriptome 
analysis through a graphical user interface (GUI) without the 
need for special computational or programming skills. GeneClou- 
dOmics enable the user to perform 23 different data analytics and 
bioinformatics tasks including differential gene expression analysis 
(DGE) using 3 different methods and the functional interpretation 
and analysis of the differentially expressed (DE) genes. In this 
tutorial, we used RNA-Seq data from wild-type and engineered 
Arabidopsis cells to demonstrate how GeneCloudOmics can be 
used to perform transcriptomic data analysis in synthetic biology 
applications. 


The authors thank Olga Sirbu (Bioinformatics Institute, A*STAR) 
for the technical support with the data analysis. 


engineered environmental _ stress-tolerant 


Metabolic pathway engineering: perspectives 
and applications. Comput Methods Prog 
Biomed 192:105436. https://doi.org/10. 
1016/J.CMPB.2020.105436 


. Erb TJ, Jones PR, Bar-Even A (2017) Syn- 


thetic metabolism: metabolic engineering 
meets enzyme design. Curr Opin Chem Biol 
37:56-62. https://doi.org/10.1016/J.CBPA. 
2016.12.023 


. McCarty NS, Ledesma-Amaro R (2019) Syn- 


thetic biology tools to engineer microbial com- 
munities for biotechnology. Trends Biotechnol 
37:181-197. https://doi.org/10.1016/J. 
TIBTECH.2018.11.002 


. Tinafar A, Jaenes K, Pardee K (2019) Synthetic 


biology goes cell-free. BMC Biol 17:1-14. 
https: //doi.org/10.1186/812915-019- 
0685-X 


. Soliman S, El-Keblawy A, Mosa KA et al 


(2018) Understanding the phytohormones 
biosynthetic pathways for developing 


crops. In: Biotechnologies of crop improve- 
ment, vol 2. Springer, Cham, pp 417-450. 
https: //doi.org/10.1007/978-3-319-90650- 
8_15 


. Mosa KA, Saadoun I, Kumar K et al (2016) 


Potential biotechnological strategies for the 
cleanup of heavy metals and metalloids. Front 
Plant Sci 7:303 


. Mao N, Cubillos-Ruiz A, Cameron DE, Col- 


lins JJ (2018) Probiotic strains detect and sup- 
press cholera in mice. Sci Transl Med 10:2586. 
hertps://fdoi.org /10,1126/ 
SCITRANSLMED.AAO2586/SUPPL_ 
FILE/AAO2586_TABLE_S1.ZIP 


. Siciliano V, Diandreth B, Monel B et al (2018) 


Engineering modular intracellular protein 
sensor-actuator devices. Nat Commun 9(1): 
1-7. https://doi.org/10.1038 /s41467-018- 
03984-5 


. Fossati E, Ekins A, Narcross L et al (2014) 


Reconstitution of a 10-gene pathway for 


262 


10. 


ll. 


12. 


13. 


14. 


15. 


16. 


7. 


18. 


19. 


20. 


Mohamed Helmy and Kumar Selvarajoo 


synthesis of the plant alkaloid dihydrosangui- 
narine in Saccharomyces cerevisiae. Nat Com- 
mun 51(5):1-11. https://doi.org/10.1038/ 
ncomms4283 


Nielsen J, Keasling JD (2016) Engineering cel- 
lular metabolism. Cell 164:1185-1197. 
https: //doi.org/10.1016/J.CELL.2016. 
02.004 


Wagner TE, Becraft JR, Bodner K et al (2018) 
Small-molecule-based regulation of 
RNA-delivered circuits in mammalian cells. 
Nat Chem Biol 1411(14):1043-1050. 
https: //doi.org/10.1038/s41589-018- 
0146-9 


Cho JH, Collins JJ, Wong WW (2018) Univer- 
sal chimeric antigen receptors for multiplexed 
and logical control of T cell responses. Cell 
173:1426-1438, ell. https://doi.org/10. 
1016/J.CELL.2018.03.038 


Kulkarni R (2016) Metabolic engineering: 
biological art of producing useful chemicals. 
Indian Acad Sci 21:233-237 

Garcia-Granados R, Lerma-Escalera JA, Mor- 
ones-Ramirez JR (2019) Metabolic engineer- 
ing and synthetic biology: synergies, future, 
and challenges. Front Bioeng Biotechnol 7: 
36. https://doi.org/10.3389/fbioe.2019. 
00036 


Comba S, Arabolaza A, Gramajo H (2012) 
Emerging engineering principles for yield 
improvement in microbial cell design. Comput 
Struct Biotechnol J 3:e201210016 

Helmy M, Smith D, Selvarajoo K (2020) Sys- 
tems biology approaches integrated with artifi- 
cial intelligence for optimized metabolic 
engineering. Elsevier B.V. 

Cheng JK, Alper HS (2016) Transcriptomics- 
guided design of synthetic promoters for a 
mammalian system. ACS Synth Biol 5: 
1455-1465. https://doi.org/10.1021/ 
ACSSYNBIO.6B00075/SUPPL_FILE/ 
SB6B00075_SI_001.PDF 

El-Metwally S$, Ouda OM, Helmy M (2014) 
Next generation sequencing technologies and 
challenges in sequence assembly, Ist edn. 
Springer, New York 

El Karoui M, Hoyos-Flight M, Fletcher L 
(2019) Future trends in synthetic biology—a 
report. Front Bioeng Biotechnol 7:175. 
https://doi.org/10.3389/FBIOE.2019. 
00175/BIBTEX 

Khalil AS, Collins JJ (2010) Synthetic biology: 
applications come of age. Nat Rev Genet 
115(11):367-379. https: //doi.org/10.1038/ 
nrg2775 


20, 


22. 


23: 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


Brooks SM, Alper HS (2021) Applications, 
challenges, and needs for employing synthetic 
biology beyond the lab. Nat Commun 
121(12):1-16. https://doi.org/10.1038/ 
s41467-021-21740-0 


Flores Bueso Y, Tangney M (2017) Synthetic 
biology in the driving seat of the bioeconomy. 
Trends Biotechnol 35:373-378. https://doi. 
org/10.1016/). TIBTECH.2017.02.002 


Kwon SW, Paari KA, Malaviya A, Jang YS 
(2020) Synthetic biology tools for genome 
and transcriptome engineering of solvento- 
genic clostridium. Front Bioeng Biotechnol 8: 
282. https://doi.org/10.3389/FBIOE.2020. 
00282 


Hrdlickova R, Toloue M, Tian B (2017) 
RNA-Seq methods for transcriptome analysis. 
Wiley Interdiscip Rev RNA 8(1):e1364. 
https: //doi.org/10.1002/WRNA.1364 


Niazian M (2019) Application of genetics and 
biotechnology for improving medicinal plants. 
Planta 249:953-973. https://doi.org/10. 
1007/S00425-019-03099-1 


Wright RC, Nemhauser J (2019) Plant Syn- 
thetic Biology: Quantifying the “Known 
Unknowns” and Discovering the “Unknown 
Unknowns.”. Plant Physiol 179:885. https: // 
doi.org/10.1104/PP.18.01222 

Poliner E, Farreé EM, Benning C (2018) 
Advanced genetic tools enable synthetic biol- 
ogy in the oleaginous microalgae Nannochlor- 
opsis sp. Plant Cell Rep 37:1383-1399. 
https: //doi.org/10.1007/S00299-018- 
2270-0 

Helmy M, Agrawal R, Ali J et al (2021) Gene- 
CloudOmics: a data analytic cloud platform for 
high-throughput gene expression analysis. 
Front Bioinf 2021:63. https://doi.org/10. 
3389 /FBINF.2021.693836 

McDermaid A, Monier B, Zhao J et al (2019) 
Interpretation of differential gene expression 
results of RNA-seq data: review and integra- 
tion. Brief Bioinform 20:2044—2054 
Mantione KJ, Kream RM, Kuzelova H et al 
(2014) Comparing bioinformatic gene expres- 
sion profiling methods: microarray and 
RNA-Seq. Med Sci Monit Basic Res 20: 
138-142. https://doi.org/10.12659/ 
MSMBR.892101 

Zou Y, Bui TT, Selvarajoo K (2019) ABio- 
Trans: a biostatistical tool for transcriptomics 
analysis. Front Genet 10:499. https://doi. 
org/10.3389 /fgene.2019.00499 

Kim S, Park J, Jeon BW et al (2021) Chemical 
control of receptor kinase signaling by 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


42. 


43. 


44. 


Transcriptomics Data Analytics for Synthetic Biology 


rapamycin-induced dimerization. Mol Plant 
14:1379-1390. https://doi.org/10.1016/J. 
MOLP.2021.05.006 


Schultheiss SJ (2011) Ten simple rules for 
providing a scientific web resource. PLoS 
Comput Biol 7:e1001126 


Love MI, Huber W, Anders S (2014) Moder- 
ated estimation of fold change and dispersion 
for RNA-seq data with DESeq2. Genome Biol 
15:550. https://doi.org/10.1186/s13059- 
014-0550-8 


Tarazona S, Furi6-Tari P, Turra D et al (2015) 
Data quality aware analysis of differential 
expression in RNA-seq with NOISeq R/Bioc 
package. Nucleic Acids Res 43:e140. https:// 
doi.org/10.1093 /nar/gkv711 


Robinson MD, McCarthy DJ, Smyth GK 
(2009) edgeR: a Bioconductor package for dif- 
ferential expression analysis of digital gene 
expression data. Bioinformatics 26:139-140. 
https://doi.org/10.1093 /bioinformatics/ 
btp616 


Raychaudhuri S, Stuart JM, Altman RB (2000) 
Principal components analysis to summarize 
microarray experiments: application to sporu- 
lation time series. Pac Symp Biocomput 2000: 
455-466. https://doi.org/10.1142/ 
9789814447331_0043 

Piras V, Tomita M, Selvarajoo K (2014) 
Transcriptome-wide variability in — single 
embryonic development cells. Sci Rep 4:1-9. 
https: //doi.org/10.1038 /srep07137 
Shannon CE (1948) A mathematical theory of 
communication. Bell Syst Tech J 27:379-423. 
https: //doi.org/10.1002/j.1538-7305.1948. 
tb01338.x 

Breiman L (1996) Bagging predictors. Mach 
Learn 24:123-140 

Cutler A, Cutler DR, Stevens JR (2012) Ran- 
dom forests. In: Ensemble machine learning. 
Springer, Boston, pp 157-175. https://doi. 
org/10.1007/978-1-4419-9326-7_5 
Villmann T, Bauer HU (1998) Applications of 
the growing self-organizing map. Neurocom- 
puting 21:91-100. https: //doi.org/10.1016/ 
$0925-2312(98)00037-X 

Cieslak MC, Castelfranco AM, Roncalli V et al 
(2020) t-Distributed Stochastic Neighbor 
Embedding (t-SNE): a tool for 
eco-physiological transcriptomic analysis. Mar 
Genomics 51:100723. https://doi.org/10. 
1016/j.margen.2019.100723 

Bateman A (2019) UniProt: a worldwide hub 
of protein knowledge. Nucleic Acids Res 47: 


45. 


46. 


47. 


48. 


49. 


50. 


51. 


52. 


53. 


54. 


263 


D506-D515. https://doi.org/10.1093/nar/ 
gky1049 

Reimand J, Isserlin R, Voisin V et al (2019) 
Pathway enrichment analysis and visualization 
of omics data using g:Profiler, GSEA, Cytos- 
cape and EnrichmentMap. Nat  Protoc 
142(14):482-517. https: //doi.org/10.1038/ 
s41596-018-0103-9 


Raudvere U, Kolberg L, Kuzmin I et al 
(2019) G:Profiler: a web server for functional 
enrichment analysis and conversions of gene 
lists (2019 update). Nucleic Acids Res 47: 
W191-W198. https://doi.org/10.1093/ 
nar/gkz369 


Franz M, Lopes CT, Huck G et al (2015) 
Cytoscape.js: a graph theory library for visuali- 
sation and analysis. Bioinformatics 32:btv557. 
https://doi.org/10.1093 /bioinformatics/ 
btv557 

Giurgiu M, Reinhard J, Brauner B et al (2019) 
CORUM: the comprehensive resource of 
mammalian protein complexes - 2019. Nucleic 
Acids Res 47:D559-D563. https://doi.org/ 
10.1093 /nar/gky973 


Vella D, Zoppis I, Mauri G et al (2017) From 
protein-protein interactions to protein 
co-expression networks: a new perspective to 
evaluate large-scale proteomic data. EURASIP 
J Bioinforma Syst Biol 2017:6 


Franz M, Rodriguez H, Lopes C et al (2018) 
GeneMANIA update 2018. Nucleic Acids Res 
46:W60-W64. https://doi.org/10.1093/ 
nar/gky311 


Kyte J, Doolittle RF (1982) A simple method 
for displaying the hydropathic character of a 
protein. J Mol Biol 157:105-132. https:// 
doi.org/10.1016/0022-2836(82)90515-0 


Amberger JS, Bocchini CA, Scott AF, Hamosh 
A (2019) OMIML.org: leveraging knowledge 
across phenotype-gene relationships. Nucleic 
Acids Res 47:D1038-D1043. https://doi. 
org/10.1093 /nar/gky1151 


Hatos A, Hajdu-Soltész B, Monzon AM et al 
(2020) DisProt: intrinsic protein disorder 
annotation in 2020. Nucleic Acids Res 48: 
D269-D276. https://doi.org/10.1093/nar/ 
gkz975 

Pifero J, Ramirez-Anguita JM, Satich-Pitarch J 
et al (2020) The DisGeNET knowledge plat- 
form for disease genomics: 2019 update. 
Nucleic Acids Res 48:D845-D855. https:// 
doi.org/10.1093 /nar/gkz1021 


® 
= Chapter 13 


Overview of Bioinformatics Software and Databases 
for Metabolic Engineering 


Deena M. A. Gendoo 


Abstract 


The explosion of the “omics” era has introduced a growing number of sets and tools that facilitate 
molecular interrogation of the metabolome. These include various bioinformatics and pharmacogenomics 
resources that can be utilized independently or collectively to facilitate metabolic engineering across disease, 
clinical oncology, and understanding of molecular changes across larger systems. This review provides 
starting points for accessing publicly available data and computational tools that support assessment of 
metabolic profiles and metabolic regulation, providing both a depth-and-breadth approach toward under- 
standing the metabolome. We focus in particular on pathway databases and tools, which provide in-depth 
analysis of metabolic pathways, which is at the heart of metabolic engineering. 


Key words Bioinformatics, Metabolomics, Pharmacogenomics, Software, Databases, Metabolic engi- 
neering, High throughput, Omics 


1. Introduction 


One can view metabolic engineering as the integration and synergy 
of two main components, which include (1) investigating the meta- 
bolome, which focuses predominantly on understanding the role of 
substrates in influencing a biological (metabolic) process or path- 
way, and (2) engineering the metabolome, which proposes new 
strategies to optimize and target metabolic networks and cellular 
processes. The engineering aspect can encompass varied tactics, 
including reconstruction of metabolic networks, protein or nucleic 
acid engineering, and analysis and manipulation of metabolic fluxes 
[1]. The ultimate goal is the optimization of cellular processes and 
metabolic pathways, for the purpose of producing desired and cost- 
effective chemical compounds at optimal conditions within a spe- 
cific organism [2, 3]. Techniques for engineering the metabolome 
to produce biofuel, chemical, and pharmaceutical products are in 
continuous development and are discussed extensively elsewhere 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_13, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


265 


266 


Deena M. A. Gendoo 


[1, 2, 4]. However, pathway design, construction, and optimiza- 
tion efforts for engineering the metabolome are contingent on a 
thorough understanding of metabolic pathways, genes and sub- 
strates involved, and networks and circuits that regulate the perfor- 
mance and flux of the metabolic network [1, 3]. This review focuses 
on the computational tools and repositories that support research- 
ers in “investigating the metabolome” toward successful engineer- 
ing efforts. 


2 A Modular View of the Metabolome 


The starting model and visualization of a metabolic pathway differ 
than that of other high-throughput omics outputs, such as the 
results generated by sequencing efforts (e.g., RNA-Seq, WGS, 
WES). Metabolic pathways can be visualized using a network or 
map-based approach [1]. The main elements (nodes) of the net- 
work include the metabolites, which are substrates or products of 
metabolism that are responsible for driving cellular functions 
[5, 6]. Other nodes include genes (or their protein products) that 
directly interact with the metabolites or which are indirectly 
affected by the metabolites. This interplay between metabolites 
and genes will also include an implicit directionality (edges), 
which signifies the upstream and downstream ends of the pathway, 
and the production of the metabolites and substrates at intermedi- 
ate steps of the process (Fig. 1). Understanding this directionality is 
relevant to metabolic engineering approaches that rely on native 
cellular pathways for synthesis of desired compounds, or which rely 
on nonnatural pathways that introduce new reactions and new 
chemistry as part of genome-scale metabolic models [3, 7]. 

Given this model of a metabolic pathway, investigating the 
metabolome can take the form of different modes of inquiry. One 
line of inquiry involves a thorough investigation of a given pathway 
and its subcomponents (metabolites and genes) and understanding 
the downstream effect of that pathway on possible genotype or 
phenotypic changes in the cell, the organism, and the metabolic 
system. Another line of inquiry entails investigating the links 
between metabolic data and other data outputs, such as high- 
throughput sequencing, pharmacogenomics, proteomics, or 
omics datasets. The goal is to relate current knowledge that is 
garnered for the metabolic pathway to other genomic or “omics” 
level data, which will provide a comprehensive system-wide view of 
the metabolic network. In addition to a more detailed view of one 
particular pathway, this can also enable comparison of multiple 
pathways in tandem. 


Overview of Bioinformatics Software and Databases for Metabolic Engineering 267 


f 


& 


Metabolite 


M,..Mn 


Gene 


G,_Gy 


Directionality 


Fig. 1 The starting model of a metabolic network. Various metabolites (M) either are released as by-products 
of a pathway (P) or contribute and interact with genes (G) toward the production of more metabolites 
downstream. Genes (G) or their protein products are connected by a directionality that indicates the upstream 
and downstream flow of the pathway 


3 Metabolic Pathway Databases and Tools 


3.1 Pathway 
Databases 


The advent of sequencing technologies over the past two decades 
produced a large number of metabolic datasets, as well as datasets 
pertaining to chemogenomics information that contains informa- 
tion about drug compounds and substrates. We highlight several 
key examples that are frequently used by researchers in the field. 


There is a growing list of databases that provide genome-scale 
network information to analyze, visualize, and manage metabolic 
pathways. Popular and highly referenced repositories include 
KEGG [8] and MetaCyc [9] (Table 1), owing to their extensive 
collection of curated metabolic networks that users can query and 
visualize. Some of these databases have appealing features from an 
informatics perspective: 


e Large-scale datasets can be accessed and downloaded via pro- 
grammatic access (e.g., using R and Bioconductor). This avoids 
reliance solely on web-based interfaces, and data can be parsed 
and efficiently integrated as part of larger and more complex 
computational pipelines. 


e The pathway datasets include pathways that span multiple 
organisms or include biochemical pathways that are independent 
of any particular organism. This facilities meta-analytical com- 
parisons of pathway behavior where needed. 


KEGG and MetaCyc describe interactions between enzymes 
and substrates using reaction maps. KEGG [8] is a reference knowl- 
edge base that integrates several databases: PATHWAY database, 
GENES/SSDB/KO database, and COMPOUND/REACTION 
database. These databases are represented as graph objects that 


268 Deena M. A. Gendoo 


Table 1 


Overview of genome-scale databases for metabolic networks 


Database Computational access and highlights 


KEGG e 
PATHWAY 


Website (https://www.genome.jp/kegg /pathway.html) 

Pathways can be downloaded GMT files as part of MSigDB 

Data is stored in the form of graph objects that reflect upon proteins, chemical 
compounds, and genes 

Parsed using R and Bioconductor packages 


BioCarta Pathways can be downloaded GMT files as part of MSigDB 


Reactome 


MetaCyc 


MSigDB ° 


3.1.1 Informatics Access 
of Metabolic Networks: An 
In-Depth Example 


Website access 

Parsed using R and Bioconductor packages 

Pathways can be downloaded GMT files as part of MSigDB 

The developer’s portal (https: //reactome.org/dev) provides pathway widgets 
that users can incorporate into their web applications and analysis service 
(including API) where users can analyze their own data against the Reactome 
database or access the Reactome content as interconnected graphs 


Website access 
R and Bioconductor packages 
GMT files as part of MSigDB 


GMT files can be downloaded for canonical pathways (CP), containing genesets 
from KEGG, BioCarta, PID, Reactome, and the WikiPathways databases 


contain information about pathways and complexes, genes and 
proteins, and biochemical compounds and reactions, respectively. 
The KEGG PATHWAY database in particular contains a collection 
of manually drawn networks that represent metabolic pathways; the 
metabolic networks are viewed as networks of enzymes and 
provided as reference pathways that are not necessarily unique to 
any particular organism [1, 8]. This modular format provides flexi- 
bility and allows parsing of the metabolic networks using a variety 
of computational tools. MetaCyc [9] is utilized for pathway predic- 
tion as part of the BioCyc dataset [10]. In contrast to KEGG, 
MetaCyc contains organism-specific metabolic network diagrams, 
with taxonomic information stored as part of their pathway anno- 
tation [11]. MetaCyc can be accessed from pathway and genome 
databases (PGDBs) such as BioCyc [10]. Enzyme and reaction 
information in MetaCyc can be accessed using the Pathway Tools 
software, a cross-platform program that powers both MetaCyc and 
BioCyc. 


Rendering of the metabolic networks as reaction maps and graph 
objects provides ample flexibility and versatility in terms of how 
these pathways are accessed, visualized, and integrated into bioin- 
formatics and engineering analysis pipelines. As an example, we 
focus on some of the computational access options provided for 
KEGG (Table 1). 


Overview of Bioinformatics Software and Databases for Metabolic Engineering 269 


Identifier start 
011 = global map 
012= overview map 


010 = chemical structure map 


07 = drug structure map 


pitt X+ 
LY — 


Prefix Integration 


map = manual NETWORK databases in KEGG 01250MR 


reference M= module 


org = organism-specific R = reaction module 
pathway N = network 


1. Metabolism 


1.0 Global and overview maps 


01100M Metabolic pathways 
01110M _ Biosynthesis of secondary metabolites 
01120™M_ Microbial metabolism in diverse environments 
01200 MR Carbon metabolism 
01210 MR 2-Oxocarboxylic acid metabolism 
01212 MR Fatty acid metabolism 

with MODULE and 01230 MR Biosynthesis of amino acids 

Biosynthesis of nucleotide sugars New! 

01240 MR Biosynthesis of cofactors 

01220 MR Degradation of aromatic compounds 


Fig. 2 Rendering of pathway maps using KEGG. Pathway maps are labeled with a five-digit number and a two- 
to four-letter prefix code (left panel). This rendering facilitates easy access to pathways that are global or 
organism-specific and the extraction of global, chemical, and drug structure maps. A representative snapshot 
of identifiers for global metabolic pathways is provided (right panel) 


Computerized access to the KEGG resources is possible 
through the KEGG API (application programming interface) and 
is provided for academic use. 

Using the API, we show a quick example of how to download a 
pathway map for the metabolic pathway related to “terpenoid 
backbone biosynthesis.” Our lab has recently focused on this meta- 
bolic pathway as a starting point to identify synergistic drug com- 
binations for targeting breast cancer using integrative 
pharmacogenomics datasets [12]. These simple steps provide a 
quick starting point to access the desired pathway: 


1. The list of all metabolic pathways under the KEGG PATHWAY 
can be identified from https://www.genome.jp/kegg/path 
way.html#metabolism. 

Pathway maps are labeled using a five-digit number and a 
prefix code (two to four letters) (Fig. 2). 


2. Using the REST API, one can search for all pathways pertain- 
ing to “‘terpenoid” using the “‘find” option of the URL: 
http://rest.kegg.jp /find/<database>/<query> 

A full description of the <database> and <query> para- 
meters is provided under the KEGG API descriptions online: 
https: //www.kegg.jp/kegg/rest/keggapi.html. 

In our example, the <database> refers to a “pathway,” and 
the <query> refers to any instance containing “terpenoid.” As 
such, the search would be conducted as: http://rest.kegg.jp/ 
find /pathway /terpenoid 


3. Our search indicates that terpenoid backbone biosynthesis is 
listed as path:map00900. 

Accordingly, to select for this map, use the “map” prefix 
along with the five-digit number of the designated pathway, 
“00900,” to access the KEGG entry for this pathway directly: 
https: //www.kegg.jp/entry/map00900 


270 Deena M. A. Gendoo 


3.1.2 Analysis of 
Pathway Enrichment 


BiocManager: :instal1C"KEGGREST" ) 
LibraryCKEGGREST) 

keggFind("pathway", "terpenoid") 
QueryPathway<-keggGet ("path:map00900" ) 


Fig. 3 Sample R commands for installation of access of the “terpenoid backbone 
biosynthesis” pathway, using the KEGGREST package in R 


Notably, KEGG has been readily incorporated in the Biocon- 
ductor project <https://www.bioconductor.org>, which contains 
a compendium of software, data, and annotation packages. The 
KEGGREST package provides quick access to the KEGG REST 
API using R and Bioconductor and includes utilities to search 
identifiers and link with other databases. Following the example 
above, the same entry can be accessed using several R commands 
which install the KEGGREST package into R, query pathway 
information in KEGG, and extract the object containing the desig- 
nated pathway (Fig. 3). 

There are several Bioconductor or CRAN packages that also 
allow users to access KEGG in a variety of formats. These include 
packages for both parsing KEGG and other compound databases, 
pathway visualization or network-based analyses of metabolites, 
and pathway enrichment analyses [13]. KEGG pathway maps are 
encoded using the KEGG Markup Language (KGML). KGML 
contains specifications of the graph objects in KEGG, which allows 
users to manipulate or reconstruct the KEGG pathway [8]. The 
KEGGgraph [14] and Pathview [15] packages enable the parsing 
and loading of KGML encoded data for every pathway, which can 
be supplied as a ‘KEGG Pathway’ object for manipulating the graph 
object within R [16]. 

Using KEGGgraph, users can parse pathways that are rendered 
under KEGG PATHWAY, including protein and chemical net- 
works (see Fig. 4 for a representative image of a network). For a 
protein network under KEGG PATHWAY, this consists of gene 
products connected by “relations” (edges). For a chemical net- 
work, connectivity between chemical compounds is illustrated by 
“reactions” [14]. Metabolic networks can be viewed as both pro- 
tein and chemical networks, which encapsulate the network of 
proteins (enzymes) and chemical compounds involved [14]. 


In addition to informatics access of metabolic networks as reaction 
maps, the classical approach to pathway analysis includes assessment 
of pathway enrichment. This is a statistical calculation that informs 
whether a given set of genes in a sample are enriched for a known 
pathway, in this case, a metabolic pathway [16-18]. Resultant 
enrichment scores are used to indicate overall whether a metabolic 


Overview of Bioinformatics Software and Databases for Metabolic Engineering 271 


NITROGEN METABOLISM 


s ry etc ae ae 61 
(aes Catton fizaton pathways ) 
\ | 


Nita Nome 
Sofa? Eee 5 ra 
(extrace Dulaz) 11.722 | 


(2m Soden }---> sia _——, 
————©( Mebane metbolsm } 
: Mcetcea ata tennis’ 
Nite oxade ( ; 
ic Ox: ————/ Glory metbolsm } 
Formate LL 


COa HCOs 


=. {EET} 0 cypmae 
Anson (63416) P02 - — - — of _ Arginine biceynthesis ) 
17110857994) 


m oa, 
ta 


o {at} 


N 
(extrace Dular’ 


0 [ms ( Ghummate | 
ipa Co at oe aon >| menbolem | 


° 
Nutoalkane Nimk 


Fig. 4 Representative example of a metabolic KEGG pathway for nitrogen metabolism. The pathway can be 
directly accessed from <https:/www.genome.jp/kegg-bin/show_pathway?ko000910>. Part of the pathway is 
shown, with genes represented in purple boxes and metabolites as small circles 


3.2 Drug Compound 
Databases 


pathway is significantly regulated in the system being studied 
[19, 20]. Many of the metabolic pathway repositories aforemen- 
tioned (including KEGG, Reactome, MetaCyc) have individual 
pathways stored as genesets (Table 1) within MSigDB 
[20, 21]. GMT files containing these genesets can be easily ported 
as part of several computational pipelines for overrepresentation 
analyses [18] by single-sample GSEA [19] or GSEA [20]. Other 
advanced algorithms and tools also include taking this further by 
including topology-based pathway enrichment, such as SPIA [22]. 


There are a number of repositories that host information about 
drug compounds, chemical substrates, bioactive molecules, as well 
as behavior of these molecules (such as the mechanism of action) 
where available. These repositories can be mined to learn about 
metabolites and by-products that are involved within metabolic 
pathways of interest, and therefore present a source of complemen- 
tary and necessary information alongside metabolic pathway data- 
bases. Accordingly, these datasets also play a significant role for 
“chemogenomics” and chemoinformatics studies that are implicit 
to metabolic engineering. Chemogenomics is an inclusive term that 
involves the screening of all possible chemical compounds against 
the universe of potential targets (proteins and drug targets) 
[23]. We highlight two of these largest datasets below. 

ChEMBL: ChEMBLisa large drug discovery, manually curated 
database that hosts information about bioactive molecules 
[24, 25]. The database is hosted by the European Bioinformatics 
Institute (EBI), which is part of the European Molecular Biology 
Laboratory (EMBL). Articles across several medicinal chemistry 
journals are mined to extract new information about bioactivity 
data for small molecules or peptides and stored as part of the 
database [24]. This includes curated linkage between 2D chemical 
structures and designated targets, alongside other information 


272 


Deena M. A. Gendoo 


about the drug properties (logP, molecular weight, Lipinski para- 
meters) [26]. Additionally, bioactivity data and screening results 
from other databases (PubChem, BioAssay) are also incorporated 
[24]. Collectively, this facilitates a number of varied investigations / 
applications, which include analyzing selectivity and off-target 
effects of drugs, identifying suitable drugs for a designated target, 
and investigating bioactivity information that was collated from 
existing experiments [24, 26, 27]. ChEMBL can be accessed at 
<https://www.ebi.ac.uk/chembl/>. The latest release (ChREMBL 
29) spans over two million compounds with associated collated 
information from over one million assays. 

PubChem: PubChem was developed in 2004 as a public reposi- 
tory hosted by the National Center for Biotechnology Information 
(NCBI), part of the National Institutes of Health (NIH) [26, 28, 
29]. The repository contains three component databases: Sub- 
stance, Compound, and BioAssay. The Substance database contains 
depositor-provided chemical data, such as data provided by aca- 
demic laboratories, pharmaceutical companies, or governmental 
research institutes [26, 28]. PubChem Compound stores internally 
reviewed chemical information that is extracted from the Substance 
database. The BioAssay database contains bioactivity screening 
studies of small molecules [26, 29]. PubChem is one of the most 
visited chemistry websites in the world, owing to the sheer volume 
of data collated from varied data sources and its growing compen- 
dium, including the very recent addition of chemical information 
from 100 new data sources [29 ]. 


4 Linking Metabolomic Data with High-Throughput Omics Profiles 


A growing range of sequencing technologies now facilitates the 
collection of large, high-throughput datasets that can be used to 
mine disease [30, 31] but also integrate several of these datatypes 
with metabolite data. One of the prominent examples of this 
merger of metabolomic profiling with other “omics” datasets is 
demonstrated by the L1000 and CMAP datasets [32]. These data- 
sets contain genotypic information pertaining to drug-treated can- 
cer cell lines, allowing users to quantify gene expression changes 
that occur due to treatment by drugs and experimental compounds 
[31]. As part of this growing effort, the KEGGlincs package 
<https://www.bioconductor.org/packages/devel/bioc/html/ 
KEGGlincs.html>, for example, allows users to load KEGG 
PATHWAY files in R, alongside key information pertaining to the 
behavior of genes within the pathway, based on knockdown experi- 
ments. This provides complementary genotypic information that 
can be merged with metabolite datasets, to provide a more com- 
prehensive picture of metabolic behavior. 


Overview of Bioinformatics Software and Databases for Metabolic Engineering 273 


5 Conclusions and Future Directions 


References 


We review multiple bioinformatics resources that are geared toward 
integrative visualization and analysis of metabolic networks, as well 
as chemogenomics and chemoinformatics databases. These 
resources encompass both databases and computational tools that 
offer a range of approaches for querying and visualizing metabolic 
networks, analysis of enrichment of genesets and metabolite sets, 
and investigating genotypic and phenotypic interactions between 
metabolites and proteins within pathways. There are a growing 
number of machine learning approaches that aim to provide more 
comprehensive understanding of the behavior of the systems and 
the right conditions needed to sustain specific metabolic behaviors, 
which are extensively described elsewhere [2, 33]. A large number 
of analyses, particularly on drug information and pharmacoge- 
nomics profiles, have also been conducted on numerous cell lines 
[31, 34], which will aid in the efforts of profiling drugs and under- 
standing drug behavior in the context of a metabolic system. How- 
ever, more efforts are needed to assess metabolic profiling and 
engineering in larger preclinical models that can provide an in vivo 
and ex vivo outlook of metabolic behaviors, including, for example, 
fields such as cancer disease modeling [35, 36]. The resources and 
tools highlighted in this review will provide starting points for 
investigating the metabolome and ultimately play a significant 
role as part of metabolic engineering efforts. 


1 


. Copeland WB, Bartley BA, Chandran D et al 


(2012) Computational tools for metabolic 
engineering. Metab Eng 14:270-280 


. Garcia-Granados R, Lerma-Escalera JA, Mor- 


ones-Ramirez JR (2019) Metabolic engineer- 
ing and synthetic biology: synergies, future, 
and challenges. Front Bioeng Biotechnol 7:36 


. Stephanopoulos G (2012) Synthetic biology 


and metabolic engineering. ACS Synth Biol 1: 
514-525 


. Ko YS, Kim JW, Lee JA et al (2020) Tools and 


strategies of systems metabolic engineering for 
the development of microbial cell factories for 
chemical production. Chem Soc Rev 49: 
4615-4636 


. Clish CB (2015) Metabolomics: an emerging 


but powerful tool for precision medicine. Mol 
Case Stud 1:2000588 


. Johnson CH, Ivanisevic J, Siuzdak G (2016) 


Metabolomics: beyond biomarkers and 
towards mechanisms. Nat Rev Mol Cell Biol 
17:451-459 


10. 


11. 


12. 


. Henry CS, Dejongh M, Best AA et al (2010) 


High-throughput generation, optimization 
and analysis of genome-scale metabolic mod- 
els. Nat Biotechnol 28:977-982 


. Kanehisa M, Goto S, Kawashima S et al (2004) 


The KEGG resource for deciphering the 
genome. Nucleic Acids Res 32:D277—D280 


. Caspi R, Billington R, Fulcher CA et al (2018) 


The MetaCyc database of metabolic pathways 
and enzymes. Nucleic Acids Res 46: 
D633-d639 

Karp PD, Ouzounis CA, Moore-Kochlacs C 
et al (2005) Expansion of the BioCyc collection 
of pathway/genome databases to 160 gen- 
omes. Nucleic Acids Res 33:6083-6089 
Altman T, Travers M, Kothari A et al (2013) A 
systematic comparison of the MetaCyc and 
KEGG pathway databases. BMC Bioinf 14:112 
Van Leeuwen J, Ba-Alawi W, Branchard E et al 
(2020) Computational pharmacogenomics 
screen identifies synergistic statin-compound 
combinations as anti-breast cancer therapies. 


274 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


Deena M. A. Gendoo 


bioRxiv. https://doi.org/10.1101/2020.09. 
07.286922 

Stanstrup J, Broeckling CD, Helmus R et al 
(2019) The metaRbolomics Toolbox in Bio- 
conductor and beyond. Metabolites 9(10):200 
Zhang JD, Wiemann S (2009) KEGGgraph: a 
graph approach to KEGG PATHWAY in R and 
bioconductor. Bioinformatics 25:1470-1471 
Luo W, Brouwer C (2013) Pathview: an 
R/Bioconductor package for pathway-based 
data integration and visualization. Bioinfor- 
matics 29:1830-1831 

Kramer F, Bayerlova M, Beifbarth T (2014) 
R-based software for the integration of path- 
way data into bioinformatic algorithms. Biol- 
ogy 3:85-100 

Karp PD, Caspi R (2011) A survey of metabolic 
databases emphasizing the MetaCyc family. 
Arch Toxicol 85:1015-1033 

Mubeen S, Hoyt CT, Gemiind A et al (2019) 
The impact of pathway database choice on sta- 
tistical enrichment analysis and predictive mod- 
eling. Front Genet 10:1203 

Barbie DA, Tamayo P, Boehm JS et al (2009) 
Systematic RNA interference reveals that onco- 
genic KRAS-driven cancers require TBK1. 
Nature 462:108-112 

Subramanian A, Tamayo P, Mootha VK et al 
(2005) Gene set enrichment analysis: a 
knowledge-based approach for interpreting 
genome-wide expression profiles. Proc Natl 
Acad Sci U S A 102:15545-15550 

Liberzon A, Subramanian A, Pinchback R et al 
(2011) Molecular signatures database 
(MSigDB) 3.0. Bioinformatics 27:1739-1740 
Tarca AL, Draghici S, Khatri P et al (2009) A 
novel signaling pathway impact analysis. Bioin- 
formatics 25:75—-82 

Kubinyi H (2007) 3.40 - Chemogenomics. In: 
Taylor JB, Triggle DJ (eds) Comprehensive 
medicinal chemistry II. Elsevier, Oxford, pp 
921-937 

Bento AP, Gaulton A, Hersey A et al (2014) 
The ChEMBL bioactivity database: an update. 
Nucleic Acids Res 42:D1083—D1090 
Glicksberg BS, Li L, Chen R et al (2019) 
Leveraging big data to transform drug 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


discovery. In: Bioinformatics and drug discov- 
ery. Humana Press, New York, pp 91-118 


Al Mahmud R, Najnin RA, Polash AH (2018) 
A survey of web-based chemogenomic data 
resources. In: Computational chemogenomics. 
Humana Press, New York, pp 3-62 

Mendez D, Gaulton A, Bento AP et al (2019) 
ChEMBL: towards direct deposition of bioas- 
say data. Nucleic Acids Res 47:D930-d940 
Kim S, Chen J, Cheng Tet al (2021) PubChem 
in 2021: new data content and improved web 
interfaces. Nucleic Acids Res 49: 
D1388-d1395 

Sayers EW, Beck J, Bolton EE et al (2021) 
Database resources of the National Center for 
biotechnology information. Nucleic Acids Res 
49:D10-d17 

Gendoo DMA, Zon M, Sandhu V et al (2019) 
MetaGxData: clinically annotated breast, ovar- 
ian and pancreatic cancer datasets and their use 
in generating a multi-cancer gene signature. Sci 
Rep 9:8770 

Kannan L, Ramos M, Re A et al (2016) Public 
data and open source tools for multi-assay 
genomic investigation of disease. Brief Bioin- 
form 17:603-615 

Subramanian A, Narayan R, Corsello SM et al 
(2017) A next generation connectivity map: 
L1000 platform and the first 1,000,000 pro- 
files. Cell 171:1437-1452, 1417 

Lawson CE, Marti JM, Radivojevic T et al 
(2021) Machine learning for metabolic engi- 
neering: a review. Metab Eng 63:34-60 
Smirnov P, Safikhani Z, El-Hachem N et al 
(2016) PharmacoGx: an R package for analysis 
of large pharmacogenomic datasets. Bioinfor- 
matics (Oxford, England) 32:1244-1246 
Gendoo DMA (2020) Bioinformatics and 
computational approaches for analyzing 
patient-derived disease models in cancer 
research. Comput Struct Biotechnol J 18: 
375-380 

Stine ZE, Schug ZT, Salvino JM et al (2021) 
Targeting cancer metabolism in the era of pre- 
cision oncology. Nat Rev Drug Discov 2021: 
1-22 


® 
once Cha pter 14 


Computational Simulation of Tumor-Induced Angiogenesis 


Masahiro Sugimoto 


Abstract 


Cancer cells require higher oxygen levels and nutrition than normal cells. Cancer cells induce angiogenesis 
(the development of new blood vessels) from preexisting vessels. This biological process depends on the 
special, chemical, and physical properties of the microenvironment surrounding tumor tissues. The com- 
plexity of these properties hinders an understanding of their mechanisms. Various mathematical models 
have been developed to describe quantitative relationships related to angiogenesis. We developed a three- 
dimensional mathematical model that incorporates angiogenesis and tumor growth. We examined angio- 
poietin, which regulates the spouting and branching events in angiogenesis. The simulation successfully 
reproduced the transient decrease in new vessels during vascular network formation. This chapter describes 
the protocol used to perform the simulations. 


Key words Tumor, Cancer, Angiogenesis, Systems biology, Simulation 


1 ‘Introduction 


The microenvironment surrounding tumors is one of the key 
factors that determine the destiny of tumor growth. Tumors are 
exposed to low oxygen levels (hypoxia), leading to high metabolic 
stress [1]. To stably obtain oxygen and nutrition, tumor angiogen- 
esis factors (TAFs) [2], such as vascular endothelial growth factor 
(VEGF) [3], are secreted from the tumor to surrounding regions to 
stimulate existing blood vessels to induce the development of new 
blood vessels (angiogenesis) [4]. Special conditions, such as the 
location of preexisting blood vessels, the distance between these 
vessels and tumors, the extracellular matrix, and the distribution of 
secreted TAFs, formulate the vascular networks [5 ]. Chemical reac- 
tions and changes in the physical properties caused by the tumor 
and distorted space also contribute to the network formation. 
Bevacizumab (Avastin) is used to prevent this angiogenesis phe- 
nomenon and inhibit the supply of molecules that will be used for 
tumor growth [6-8 ]. However, the prediction of treatment efficacy 
is still difficult because of the dependency of angiogenesis on 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_14, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


275 


276 Masahiro Sugimoto 


2 Methods 


2.1 Protocols 


multidisciplinary features. Therefore, understanding the relation- 
ship of this biological process is important but difficult. 

Various mathematical simulation models have been developed 
to understand tumor-induced angiogenesis [9, 10]. The available 
models are classified into three types: angiogenesis [1 1-15], tumor 
growth [16-19], and integration of both [20-25]. Conventional 
models are implemented in two-dimensional spaces because of their 
high computational cost. Recently, models implemented in three- 
dimensional space have become available [26-28]. In addition, 
recent models incorporated various factors compared to the con- 
ventional models in terms of chemical and physical properties, 
considering only a few factors. Both of these improvements 
would contribute to more reproducible and realistic angiogenesis 
processes. 

Tang et al. developed a model three-dimensional space to 
reproduce angiogenesis and tumor growth considering the chemi- 
cal, physical, and special properties of the tumor microenvironment 
[29]. This model implemented the spouting of a new blood vessel 
depending on the TAF concentration. However, the spouting 
mechanism is more complex and depends on the angiopoietin 
family [30]. Angiopoietin is expressed in vascular endothelial cells 
and regulates the adhesion between vascular wall cells and endo- 
thelial cells. Angiopoietin-1 (Ang-1) promotes endothelial-parietal 
cell adhesion and vascular maturation by binding to the receptor 
tyrosine kinase Tie-2. Ang-2 is an antagonist of the Tie-2 receptor, 
which weakens cell-cell adhesion. Therefore, the balance between 
Ang-1l and Ang-2 controls the stabilization and remodeling of 
blood vessels and capillary sprouting [31]. 

We modified the model of Tang et al. to implement Ang-1 and 
Ang-2 as the regulatory functions of vascular flexibility 
[32]. Through the angiogenesis process, temporal regression of 
blood growth has been observed in vivo [33]. Our model success- 
fully reproduced the simulation of this phenomenon. 


The physical and chemical processes involved in tumor growth and 
angiogenesis are described here. The overall concept of a tumor, 
including preexisting blood vessels, new blood vessels, and 
distributed molecules, is depicted in Fig. 1. The distribution of all 
molecules is described in the partial differential equations. For each 
time step, the pressure gradient and the distribution of each factor, 
such as oxygen, were calculated, and the status of new blood vessel 
formulation and tumor growth was updated. An overview of each 
step is provided below. 


Computational Simulation of Tumor-Induced Angiogenesis 277 


Preexisting vessel 


O, Ang-1, and Sprouting 
Ang-2 secretion at TEC 
Branching New blood 
vessel 


VEGF gradient 
Secretion 


of VEFG and CO, Tumor 


Fig. 1 The overall concept of tumor-induced angiogenesis. The top red area is the preexisting vessel and the 
center circle is a tumor. The preexisting vessel and tumor distribute various elements, such as VEGF and CO>. 
Angiopoietins (Ang-1 and Ang-2) contribute to the sprouting and branching of new blood vessels 


200. Discretize each axis 
to 200 grids 


z-axis 100 


Define the initial 


Initial cancer cells (n=5) O \ preexistent blood 


at x=100, y=100, z=100 ‘ silaiiad 
100 \ + 
X-axis \ 200 
200 9 100 
Y-axis 
Fig. 2 Initialization of the simulation space 
2.1.1 Initialization 1. Initialize the simulation space and discretize the space (e.g., 


200 x 200 x 200 grid space). 


2. Place fine cancer cells at the center of the computational 
domain. 


3. Place the preexisting blood vessels. The prepared blood vessels 
should have a distance from the center of the cancer cells (Fig. 2). 


4. Calculation: Each of the 33 simulation steps is calculated. The 
33 steps comprise 1 day. For each step, the pressure gradient 
and factor distribution are calculated, and the new blood vessel 
and tumor cells are subsequently updated. The dependencies of 
the variables are shown in Fig. 3. 


278 Masahiro Sugimoto 


Tissue level Cell level 


Factor Cell Tumor 
phenotype growth 


Pressure Angiogenesis 


distribution 


Cell activity - Pressure and 


; = Nutrient 
Cell vitality 
| Cell division | 
Interstitial San Tumor growth 
fluid veloci ell state 


Fig. 3 Overall processing and parameter dependencies at each step of the simulation. The process starts from 
the tissue-level calculation to cell-level ones, which includes pressure, factor distribution, angiogenesis, cell 
phenotype, and tumor growth. For example, in the pressure calculation, (1) CTP, (2) VTP, and (3) interstitial 
fluid velocity are calculated. Subsequently, distributions of various factors (e.g., O2 and CO>) are calculated. 
TAF, Ang-1, and Ang-2 contribute to the vessel formulation and these processes form a loop. O02 and CO. 
contribute to the cell phenotype and tumor growth. CTP and VTP indicate cell-induced tumor pressure and 
vascular perfusion-induced tumor pressure, respectively 


5. Pressure: The tumor microenvironment pressure is calculated 
by combining cell-induced tumor pressure (CTP) and vascular 
perfusion-induced tumor pressure (VIP). The CTP at a spatial 
point is calculated as the sum of the pressures caused by the 
surrounding N tumor cells. VIP is calculated as the sum of 
pressures caused by vascular endothelial cells present next to 
the k point Xo, using a method similar to that used for CTP. 
Pressure (P) is calculated by adding CTP and VTP in each grid 
point. 


6. Oxygen (O2): Oz diffusion is calculated. Oz is assumed to be a 
nutrient necessary for tumor growth. The four processes 
involved are O , diffusion, convection, O 2 secretion from 
blood cells, and Oz consumption by tumor cells. The spatial- 
temporal evolution for O2 concentration (7) is calculated as: 


On 2, ey ae 
or D,Von — Vii unin) 


+ Pyl ys (pv — p))’Zy — An(Ai)°Qr, (1) 


where 7 is the O2 concentration, D,, is the diffusion constant of Oz, 
uw is the pressure gradient, A; is the cellular activity of tumor cells, 7, 
is the removal term, and °Zy and °Q, indicate the activity of all 
vascular cells and all tumor cells, respectively. 


Computational Simulation of Tumor-Induced Angiogenesis 279 


The third term is the kinetics of O2 secretion from blood cells, 
calculated as: 


PulTrs (pr — p)) = pr RiW, (2) 


where Pyo is the O, supply rate, R; is the radius of the vessel, and 
Wis the pressure gradient in the arterial wall. 
The fourth term is O2 consumption, whose rate is calculated as: 


An( Ai) = Ano Ai, (3) 
Ai=sT4 exp ( 5(w i), (4) 


where A,,9 is the O2 consumption rate and w is the concentration of 
carbon dioxide (CO3). 


(a) Cell activity: Calculating the cell activity determines the activ- 
ity status of each tumor cell. Cell activity is calculated accord- 
ing to the nutrient acquisition status estimated from the O» 
concentration. The activity of the cells between the active and 
quiescent states is reversible. Based on Eq. 4, the cell activity is 
calculated as: 
A; > 0.5 (active) (5) 
A; <0.5 (quiescent) | 


(b) Cell vital energy (CVE): CVE is the energy stored in the cell 
for proliferation. This concept explains tumor cell progression 
for proliferation, as well as cell life and death. The derivative 
value of CVE is calculated by the cell activity with positive 
active kinetics and negative quiescent kinetics as follows: 


Ay ; 
AV _ A+1 ky (active) (6) 
at 
=k; (quiescent) 


Based on the cell activity and CVE, the tumor cell status is 
determined to be active, necrotic, or quiescent (Fig. 4). 


Cell Activity > 
CVE>CVE,, Cell Activity, 


Calculation 


Quiescent 


Fig. 4 The change of tumor status is based on cell activity and cell vital energy 
(CVE). Cell activity, and CVE,, indicate a predefined threshold of these para- 
meters. Necrotic, quiescent, and active are the possible statuses of a cancer cell 


280 Masahiro Sugimoto 


2.2 Implementation 


(c) 


Tumor angiogenesis factors (LAFs): The distribution of TAFs 
in the simulation space is determined. Among various types of 
molecules of TAF, only VEGF is considered. The concentra- 
tion of VEGF (c) is calculated by diffusion, convection, secre- 
tion by tumor cells, and removal from the vessels as follows: 
ce = DV*c — Vail uiic) + p,(n)Qr — A ry)’ZrEc, (7) 

where D, is the VEGF diffusion constant and °Lrgc¢ is the 
occurrence of the tip endothelial cells (TEC). 


The rates of VEGF secretion and consumption are assumed to 


be proportional to Oz concentration and vessel radius, respectively. 


(d) CO, in the simulation space is calculated. The calculated 


process of CO, kinetics includes diffusion, convection, secre- 
tion by tumor cells, and removal by blood cells like that used 
for VEGE. 


(e) Ang-1 and Ang-2: Angiopoietin is a capillary sprouting angio- 


7s 


genesis factor. The angiopoietin family regulates the adhesion 
levels between vessel wall cells and endothelial cells by binding 
to the Tie-2 receptor-type tyrosine kinase. Tie-2 is expressed 
in vascular endothelial cells and regulates angiogenesis. Ang-1 
stimulates Tie-2 on vascular endothelial cells during angiogen- 
esis. Ang-2 inhibits Ang-1 binding to Tie-2. Therefore, the 
balance between Ang-1 and Ang-2 governs the vascular state. 
The distributions of Ang-] and Ang-2 are calculated using 
diffusion, convection, secretion, and elimination terms, such 
as VEGF and Op). The secretion rate of angiopoietin is 
assumed to increase with the density of endothelial cells or 
TECs. The consumption rate is determined based on the 
concentration of angiopoietin. 


Sprouting and regression: The formulation of blood vessels is 
updated. TAF promotes the sprouting of new blood vessels. In 
addition, capillary sprouting and branching are determined by 
the distance of each cell from the tumor and the balance of 
angiopoietin concentrations. Figure 5 shows the relationships 
among the described factors. 


Stop: The simulation is stopped after 60 days. 


All simulations described in this manuscript were performed using 
MATLAB R2019B (MathWorks, Natick, MA, USA) software. The 
simulation environment used an INTEL XEION CPU E3-1230 


V2 


run 


3.30 GHz, RAM 20 GB memory computer. Each simulation 
takes approximately 2 days. The source code (available upon 


request) is run in MATLAB. 


Computational Simulation of Tumor-Induced Angiogenesis 281 


Vessel 


extension 


> 
Days > Daysy, Tip cell Vessel 


deactivated regression 


Fig. 5 Flowchart of new blood growth. VEGF;, and Days;, indicate the thresholds of these parameters 


A) B) C) D) E) 
ai a0 200 
100 100 100 
0 0, 
0 0 e 
100 100 
100 
200 ‘9 100 7 200"5 100 og A 100 200 
F) G) J) 
200 200 200 
100 100 100 
0! 0 
0 0 ol j 
100 100 
200 200 ©1100 
200 9 100 200 9 100 200 9 100 200 


Fig. 6 Snapshot of the spatial distribution of preexisting and new blood vessels and tumors from day 0 (a) to 
60 (j) 


2.3 Simulated Simulation models were developed for 60 days. Figure 5 shows the 
Results preexisting blood vessels (red curves), newly developed blood ves- 
sels (blue curves), and tumors (brown circles). Preexisting blood 
was defined as the initial condition (Fig. 5a). Initially, the tumor 
grew without new blood vessel development (Fig. 5b). New blood 
vessels subsequently developed, and several reached the tumor 
(Fig. 5c-f). The tumor size increased. The new blood vessels were 
remodeled and showed a temporal regression (Fig. 5g). Finally, 
new blood vessels developed again, and the tumor grew rapidly 
(Fig. 5h-). 
The distribution of VEGF at the cross-section (y = 100) is 
shown in Fig. 6. VEGF was not observed in the initial stage 
(Fig. 7a). The VEGF concentration gradually increased, and the 


0 


Masahiro Sugimoto 


B) 


200 


100 


200 


200 


100 200 


Fig. 7 Temporal change of distribution of VEGF at the cross-section (y= 100) from day 0 (a) to 60 (j) 


distributed area also expanded (Fig. 7a—-f). Subsequently, the con- 
centration and distributed area became more stable (Fig. 7g-4). 


This research was funded by grants from JSPS KAKENHI (grant 


Acknowledgment 
numbers 20B205). 
References 
1. Wilson WR, Hay MP (2011) Targeting hyp- 


oxia in cancer therapy. Nat Rev Cancer 11(6): 
393-410 


. Shweiki D, Itin A, Soffer D et al (1992) Vascu- 


lar endothelial growth factor induced by hyp- 
oxia may mediate hypoxia-initiated 
angiogenesis. Nature 359(6398):843-845 


. Plate KH, Breier G, Weich HA et al (1992) 


Vascular endothelial growth factor is a potential 
tumour angiogenesis factor in human gliomas 
in vivo. Nature 359(6398):845-848 


. Folkman J, Magsbrun M (1987) Angiogenic 


factors. Science 235(4787):442-447 


. Yadav L, Puri N, Rastogi V et al (2015) 


Tumour angiogenesis and angiogenic inhibi- 
tors: a review. J Clin Diagn Res 9(6):XE0O1 


. Costache M, Ioana M, Iordache S et al (2015) 


VEGF expression in pancreatic cancer and 
other malignancies: a review of the literature. 
Rom J Intern Med 53(3):199-208 


. Ferrara N, Hillan KJ, Novotny W (2005) Bev- 


acizumab (Avastin), a humanized anti-VEGF 
monoclonal antibody for cancer therapy. Bio- 
chem Biophys Res Commun 333(2):328-335 


10. 


11. 


12. 


13. 


14. 


. Lauro S, Onesti CE, Righini R et al (2014) The 


use of bevacizumab in non-small cell lung can- 
cer: an update. Anticancer Res 34(4): 
1537-1545 


. Metzcar J, Wang Y, Heiland R et al (2019) A 


review of cell-based computational modeling in 
cancer biology. JCO Clin Cancer Inform 2: 
1-13 

Vilanova G, Colominas I, Gomez H (2017) 
Computational modeling of tumor-induced 
angiogenesis. Arch Comput Methods Eng 
24(4):1071-1102 

Jones PF, Sleeman BD (2006) Angiogenesis— 
understanding the mathematical challenge. 
Angiogenesis 9(3):127-138 

Milde F, Bergdorf M, Koumoutsakos P (2008) 
A hybrid model for three-dimensional simula- 
tions of sprouting angiogenesis. Biophys J 
95(7):3146-3160 

Perfahl H, Hughes BD, Alarcén T et al (2017) 
3D hybrid modelling of vascular network for- 
mation. J Theor Biol 414:254-268 

Travasso RD, Corvera Poiré E, Castro M et al 
(2011) Tumor angiogenesis and _ vascular 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22; 


23. 


24. 


Computational Simulation of Tumor-Induced Angiogenesis 


patterning: a mathematical model. PLoS One 
6(5):e19989 

Zhao G, Yan W, Chen E et al (2013) Numerical 
simulation of the inhibitory effect of angiosta- 
tin on metastatic tumor angiogenesis and 
microenvironment. Bull Math Biol 75(2): 
274-287 

Anderson AR, Weaver AM, Cummings PT et al 
(2006) Tumor morphology and phenotypic 
evolution driven by selective pressure from the 
microenvironment. Cell 127(5):905-915 
Jiang Y, Pjesivac-Grbovic J, Cantrell C et al 
(2005) A multiscale model for avascular 
tumor growth. Biophys J 89(6):3884-3894 
Peng L, Trucu D, Lin P et al (2017) A multi- 
scale mathematical model of tumour invasive 
growth. Bull Math Biol 79(3):389-429 
Shirinifard A, Gens JS, Zaitlen BL et al (2009) 
3D multi-cell simulation of tumor growth and 
angiogenesis. PLoS One 4(10):e7190 

Lyu J, Cao J, Zhang P et al (2016) Coupled 
hybrid continuum-discrete model of tumor 
angiogenesis and growth. PLoS One 11(10): 
e0163173 

Mahlbacher G, Curtis LT, Lowengrub J et al 
(2018) Mathematical modeling of tumor- 
associated macrophage interactions with the 
cancer microenvironment. J Immunother Can- 
cer 6(1):1-17 

Salavati H, Soltani M, Amanpour S (2018) The 
pivotal role of angiogenesis in a multi-scale 
modeling of tumor growth exhibiting the avas- 
cular and vascular phases. Microvasc Res 119: 
105-116 

Stéphanou A, Lesart A-C, Deverchére J et al 
(2017) How tumour-induced vascular changes 
alter angiogenesis: insights from a computa- 
tional model. J Theor Biol 419:211-226 

Xu J, Vilanova G, Gomez H (2016) A mathe- 
matical model coupling tumor growth and 
angiogenesis. PLoS One 11(2):e0149422 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


283 


Yonucu S, Yilmaz D, Phipps C et al (2017) 
Quantifying the effects of antiangiogenic and 
chemotherapy drug combinations on drug 
delivery and treatment efficacy. PLoS Comput 
Biol 13(9):e1005724 

Liang W, Zheng Y, Zhang J et al (2019) Multi- 
scale modeling reveals angiogenesis-induced 
drug resistance in brain tumors and predicts a 
synergistic drug combination targeting EGFR 
and VEGEFR pathways. BMC Bioinformatics 
20(7):59-71 

Wijeratne PA, Vavourakis V (2019) A quantita- 
tive in silico platform for simulating cytotoxic 
and nanoparticle drug delivery to solid 
tumours. Interface Focus 9(3):20180063 

Xu J, Vilanova G, Gomez H (2017) Full-scale, 
three-dimensional simulation of early-stage 
tumor growth: the onset of malignancy. Com- 
put Methods Appl Mech Eng 314:126-146 
Tang L, Van De Ven AL, Guo D et al (2014) 
Computational modeling of 3D tumor growth 
and angiogenesis for chemotherapy evaluation. 
PLoS One 9(1):e83962 

Maisonpierre PC, Suri C, Jones PF et al (1997) 
Angiopoietin-2, a natural antagonist for Tie2 
that disrupts in vivo angiogenesis. Science 
277(5322):55-60 

Fagiani E, Christofori G (2013) Angiopoietins 
in angiogenesis. Cancer Lett 328(1):18-26 
Yanagisawa H, Sugimoto M, Miyashita T 
(2021) Mathematical simulation of tumour 
angiogenesis: angiopoietin balance is a key fac- 
tor in vessel growth and regression. Sci Rep 
11(1):1-13 

Holash J, Maisonpierre P, Compton D et al 
(1999) Vessel cooption, regression, and 
growth in tumors mediated by angiopoietins 
and VEGF. Science 284(5422):1994-1998 


Check for 
updates 


Computational Methods and Deep Learning for Elucidating 
Protein Interaction Networks 


Dhvani Sandip Vora, Yogesh Kalakoti, and Durai Sundar 


Abstract 


Protein interactions play a critical role in all biological processes, but experimental identification of protein 
interactions is a time- and resource-intensive process. The advances in next-generation sequencing and 
multi-omics technologies have greatly benefited large-scale predictions of protein interactions using 
machine learning methods. A wide range of tools have been developed to predict protein-protein, pro- 
tein-nucleic acid, and protein-drug interactions. Here, we discuss the applications, methods, and challenges 
faced when employing the various prediction methods. We also briefly describe ways to overcome the 
challenges and prospective future developments in the field of protein interaction biology. 


Key words Deep learning, Machine learning, Interaction, PPI, Protein networks, Neural networks 


1. Introduction 


The discovery of the DNA structure in 1953 prompted multiple 
studies into macromolecules and their effects on the various prop- 
erties of life. Roles of other types of biomolecules in the cell were 
established — RNA as intermediates and proteins as the effector 
molecules. However, further investigations revealed other complex 
types and functions of these macromolecules. Since the start of the 
twenty-first century, with the development of various sequencing 
programs and platforms, multiple databases to store biological data 
have been established. With a surge in biological data, bioinformat- 
ics has become an essential component of natural sciences. To 
understand the various biological processes, it is also necessary to 
identify the components, their roles, and their relationships. 
Biological systems are a complex web of interactions — gene 
transcription, metabolic signaling, and protein-protein interac- 
tions, to name a few layers that stack up to form an organism. 
Identified with philosopher Descartes, the reductionism approach 
asserts that a complex system or situation can be better analyzed by 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_15, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


285 


286 Dhvani Sandip Vora et al. 


1.1 Protein 
Interaction 
Identification Methods 


reducing it to a sum of simpler parts. Similarly, in biology, higher 
levels in a hierarchy can be understood by studying the individual 
components. In the current post-genomics era, as more omics data 
is generated, there is a dearth of appropriate representations to 
describe biological networks. Individual members of a biological 
network, like genes, proteins, or drugs, can be represented as 
nodes, while the edges represent the various physical, biochemical, 
or functional interactions between the nodes. Whole cells and 
organisms can be represented by biological networks — collectives 
of features and behaviors — and can be systematically studied to 
predict associations between molecules, genes, diseases, and drugs 
and their targets. However, since the nature of biological data is 
complex and dynamic, the analysis requires multidisciplinary 
approaches. 


Protein interactions play a vital role in the formation of structures 
and enzymatic regulation in a cell, maintaining homeostasis in the 
organism. Predicting the functions and interactions of proteins is 
among the most crucial pursuits in biology, yet most bioinformatics 
solutions are template-based algorithms. Although various bioin- 
formatics approaches are available, they are limited by the accuracy 
of the prediction model, and hence, experimental methods are still 
considered more reliable [1, 2]. 

X-ray crystallography is a preferred method for determining 
full-atom coordinates of a protein complex; however, it is costly 
and time-intensive [3]. Moreover, not all proteins or protein com- 
plexes get crystallized easily. Although the other experimental tech- 
niques do not provide atomic-level information, they are more 
popular because of their reliability over crystallography. Interaction 
detection approaches can be classified as either of the many types 
(Table 1). Depending on the organism and the goal, various exper- 
imental techniques are available to detect and identify protein 
binding events [4-6]. Each technique has advantages but requires 
specific instruments and extensive knowledge for result analysis. As 
the field is developing, new and improved methods to reliably 
predict interactions are emerging. Yet, with the upsurge in available 
omics data in the recent past, in silico methods will need to play an 
increasing part in determining the protein interactions at the 
atomic scale. 

The predictive methods to model protein complexes can be 
categorized into (i) homology modeling and (ii) ab initio or 
template-free docking. Template-based predictions depend heavily 
on the presence of similar structures reported in literature or pro- 
tein databases. Since template-based methods are limited by the 
number of quaternary structures available, ab initio methods are 
gaining traction with the increasing number of macromolecule 
sequences available. Protein domains and chains are known to be 
dynamic and undergo multiple conformational changes, reducing 


Table 1 
Different experimental techniques for detecting protein interactions 


Interaction method detection type Name Reference 
Biochemical Affinity technology WS), 
Aggregation assay [176] 
Chromatography technology [175] 
Cosedimentation [75] 
Cross-linking study [175] 
Electrophoretic mobility-based method [177] 
Enzymatic study [175] 
Probe interaction assay [175] 
Biophysical Biosensor [178] 
Circular dichroism [179] 
Mass spectrometry [177] 
Equilibrium dialysis [180] 
Filter trap assay [181] 
Fluorescence technology [176] 
Infrared spectroscopy [182] 
Intermolecular force [183] 
Isothermal titration calorimetry [184] 
Light scattering [185] 
Neutron fiber diffraction [186] 
Nuclear magnetic resonance [187] 
Scintillation proximity assay [188] 
Small angle neutron scattering [179] 
Ultraviolet-visible spectroscopy [189] 
X-ray crystallography Is] 
Genetic Chemical RNA modification plus base [190] 
Random spore analysis [175] 
Synthetic genetic analysis [175] 
Imaging techniques Atomic force microscopy [191] 
Confocal microscopy [175] 
Electron microscopy [192] 
Fluorescence microscopy [175] 
Light microscopy [175] 
Super-resolution microscopy [175] 
X-ray tomography [175] 
Phenotype-based Nuclear translocation assay [193] 
Posttranscriptional Antisense RNA [194] 
RNA interference [195] 
Protein complementation Adenylate cyclase complementation [196] 
B-galactosidase complementation [197] 
B- lactamase complementation [198] 
Bimolecular fluorescence complementation [199] 
Mammalian protein-protein interaction trap [200] 
Protein kinase A complementation {201] 
Reverse ras recruitment system [202] 
Split luciferase complementation [203] 
Tox-R dimerization assay [204] 
Transcriptional complementation assay [175] 


An overview of some of the commonly used experimental techniques to detect and elucidate protein interactions, 
grouped into categories based on the methods followed 


288 Dhvani Sandip Vora et al. 


Table 2 


Structure-based methods for modeling and predicting protein interactions 


Name Approach Reference 

ClusPro Evaluation of presumed complexes, retains a few promising complexes, scored [205] 
based on electrostatic and free energies 

GRAMM-X_ Global search is performed by using FFT, and Lennard-Jones potential is [206] 
implemented on a fine grid to determine best surface match 

HexServer Implements a closed-form spherical polar FFT correlation expression [207] 

LZerD Predictions are generated by using 3DZD — a mathematical protein surface [208] 
representation method 

Multi- Predictions from LZerD are combined using a genetic algorithm and scored [209] 

LZerD using several methods 

PatchDock The surface of the two molecules is segmented into geometric patches. The [210] 
patches containing interacting residues are filtered, and pose clustering 
techniques are applied 

RosettaDock Monte Carlo-based docking algorithm [211] 

ZDOCK A 3D FFT search of degrees of freedom between two proteins is carried out [212] 


and scored using statistical potential 


Some of the available software and servers available for predicting protein interactions and scoring possible conformations 


the accuracy of the computationally-intensive docking approach. 
Numerous other techniques have also been proposed for structure- 
based interaction prediction, summarized in Table 2. 

While multiple methods are available for predicting protein 
interactions, quality assessment of the predictions for ranking and 
elimination of unlikely complexes and poses is crucial. Docking 
programs generally perform an energy-based scoring of the com- 
plexes to determine relevant structures. However, the scores 
assigned are relative and cannot be compared across platforms. 
Consensus clustering is another method that helps determine qual- 
ity of predicted poses, by clustering structures of similar scores 
together. The score could either be the root-mean-square deviation 
(RMSD) or the template modeling (TM) score. 

Another class of methods to predict protein interaction net- 
works, based on network topology, are not reliant on new 
biological data. The topology of the known interaction networks 
is utilized to predict missing links based on the triadic closure 
principle [7]. Similar to social network analyses, protein pairs are 
given a higher score when interaction partners are shared. 

The progress made to overcome the formidable challenge of 
laying down a comprehensive map of an organism’s interactome, 
especially that of complex eukaryotic beings, has been slow but 
steady. The entire human interactome is estimated to comprise 
more than 100,000 binary protein interactions [8], around only 
half of which have been identified so far through the Human 


1.2 Machine 
Learning 


1.2.1 Evaluation of 
Machine Learning Models 


Prediction of Protein Interactions 289 


Reference Protein Interactome (HuRI) project [9]. Interactomes 
of various organisms have been studied, but most have been incom- 
plete [10-12]. 

However, such experimental and in silico methods cannot 
completely identify the effects of physicochemical factors or track 
the transient dynamics of the complex. Moreover, the differences in 
binding affinities because of loops or disordered regions, posttrans- 
lational modifications, or the influence of physiological factors are 
difficult to predict. Hence, there is a need for accurate prediction 
models that can effectively identify even transient interactions, 
expanding the coverage of interactions predicted while filtering 
out the false-negative and false-positive hits. Moreover, the rele- 
vance and statistical significance of the predicted interactions and 
interaction networks need to be determined. Computational 
approaches have proved beneficial in extrapolating from experi- 
mental data and may help determine the complete interactome of 
organisms. 


Analysis of big data derived from biological sources and subsequent 
prediction of related features have been made possible by advances 
in machine learning (ML). ML algorithms have been implemented 
for the prediction of protein interactions, based on both sequence 
and structure of proteins [13-15]. The input to these predictors is 
observable quantities, analyzed to make statistical predictions. In 
the case of protein interactions, these input “features” are the 
sequence, secondary structure, motifs, domains, genomic features 
such as gene context, and phylogeny; more recently, network 
topology-derived features are used as well. 

ML algorithms can also be classified into glass-box and black- 
box models, depending on whether knowledge of the transforma- 
tion of input to output is available or not. Algorithms such as 
decision trees, random forests, and support vector machines 
(SVMs) allow the generation of explanations underlying the pre- 
diction mechanisms, while artificial neural networks, called black- 
box models, do not allow such explanations. Examples of such 
algorithms and their advantages in protein interaction prediction 
will be discussed later in the chapter. 


The predictions of most algorithms, in this case, are binary — posi- 
tive or negative — i.e., presence or absence of interactions. The 
outcomes of such predictions could be that either of the class is 
predicted correctly (true positives or true negatives) or the predic- 
tions could be incorrect (false positive or false negative). Hence, to 
quantitate the efficiency of the algorithms, multiple threshold- 
dependent measures are employed, as summarized in Table 3. 
Depending on the objective and dataset available, either metrics 
may be considered more important, or the algorithm is developed 
to improve that score. 


290 Dhvani Sandip Vora et al. 


Table 3 


Prediction model metrics 


Metrics Expression Definition 
Accuracy (TN + TP)/ Fraction of correct predictions 
(IN + TP + FP + FN) 
Precision TP/(TP + FP) Positive predictive rate 
Recall/ TP/(TP + EN) Fraction of correctly predicted positive samples 
sensitivity 
Specificity TN/(IN + EP) Fraction of correctly predicted negative samples 
Fl score (2*Precision*Recall) Harmonic mean of precision and recall 
(Precision+Recall) 
AUC = Area under the curve (true vs. false positive rate OR 


precision vs. recall) 


A few parameters to measure algorithm prediction performance have been listed. The term TN stands for true negative, 
TP for true positive, FP for false positive, and FN for false negative 


1.3 Protein 
Interaction Databases 


1.3.1 Primary Databases 


The aim of developing ML models is to gain the ability to 
correctly predict novel interacting partners given a limited dataset, 
i.e., the ML model should be able to generalize well on new data. 
The generalizability of the model depends on the input dataset as 
well as the complexity of the prediction algorithm. A complex 
model would train well on the input data but fall short on new 
samples, while, on the contrary, if the model is too simple, it would 
not train well on the given data. Both these extremes, termed 
overfitting and underfitting, are evaluated while measuring model 
performance. Testing the prediction performance on an indepen- 
dent test set is required to determine the robustness of the predic- 
tor. In the case of neural networks, learning on the training dataset 
is generally followed by evaluation on a validation dataset to reduce 
errors and then to test on an independent dataset measures of 
robustness. 


Thorough scrutiny of published literature and the increasing reli- 
ability of computational predictions have allowed the creation of 
databases of protein interactions. These databases serve as impor- 
tant pools of information to build template-based models and 
machine learning-based predictive models. Protein interaction 
information obtained from various experimental and computa- 
tional methods is compiled in various online resources. These 
databases are generally classified into two categories: 


Collected and curated manually, the protein interaction informa- 
tion available in primary databases are derived from small- or large- 
scale experimental procedures (Table 4). 


Prediction of Protein Interactions 291 


Table 4 
Databases for protein interaction studies and predictions 


Database Type Description Reference 


BioGRID Primary _ Biological General Repository for Interaction — PPI database for [213] 
multiple model organisms 


DIP Secondary Database of Interacting Proteins — curated PPI database for multiple [214] 
organisms 


HINT Secondary High-quality INTeractomes — curated PPI database for multiple [215] 
organisms 


HPRD Primary Human Protein Reference Database — database of PPIs from high- [216, 217] 
throughput experiments 


STRING Secondary Tool for obtaining functional enriched PPI networks for multiple [218] 
model organisms 


mentha Secondary Public archival for PPI data [219] 

HIPPIE Secondary Human Integrated PPI rEference — tool to generate human PPI [220] 
networks 

HuRI Primary Database of human binary protein interactions [9] 

MINT Primary Database of PPIs based on literature [221i 


Commonly used databases of protein interactions, with the type mentioned — primary indicating derived and curated 
based on experiments and secondary indicating even predictions are available 


1.3.2 Secondary The protein interactions derived from experiments as well as high 
Databases confidence predictions from computational approaches are com- 
piled in databases termed secondary. 

Since carrying out experiments for multiple types of proteins 
across organisms is challenging and limited by time, expertise, and 
cost, computational approaches based on protein interactions 
reported in the two categories of databases are used to develop 
prediction algorithms. Moreover, Table 5 compiles multiple 
resources that consolidate the already available PPI data in a user- 
friendly interface. 


2 Methods 
2.1 Feature Machine learning-based predictors of protein interactions are 
Extraction trained on a set of feature vectors that attempt to define important 


information of the proteins and the complexes. ML algorithms can 
be adapted to discriminate between interacting proteins based on 
specific factors that are different between an interacting pair and a 
pair that does not interact. A crucial step that determines the 
performance of the ML model is feature extraction and representa- 
tion. Retaining all essential information intrinsic to the protein and 
its interacting partner remains a hurdle in identifying and 


292 Dhvani Sandip Vora et al. 


Table 5 


Available tools and databases for PPls 


Resource name 


Description URL 


Human Integrated 
PPI 


Reference (Hippie) 


Molecular 
INTeraction 
(MINT) 


Human Protein 
Reference Database 
(HPRD) 


Dobson & Doig 
(D & D) 


Protein Interaction 
Network Analysis 
(PINA) 


STRING 


Web tool to generate human PPI networks - 


Integrating protein interaction networks with http://cbdm-01.zdv.uni- 
experiment-based quality scores mainz.de/~mschaefer/ 
hippie 
Database of PPIs for multiple model organisms https: //mint. bio. 


uniroma2.it/ 


Database of human PPIs from high-throughput www.hprd.org 


experiments 
Benchmark dataset of 1178 protein structures https: //graphlearning.io 
Database of PPIs for multiple model organisms https: //omics.bjcancer. 


org/pina 


Database of PPIs and tool for obtaining functional https: //string-db.org 


enriched PPI networks for multiple model 
organisms 


2.1.1 Sequence Features 


predicting novel protein interactions. Various types of features have 
been used in the recent past to describe proteins and predict inter- 
actions (Fig. 1). The major categories are discussed as follows. 


The primary structure of the protein is the linear sequence of amino 
acids that form the building blocks [16]. The sequence of the 
protein decides the structure, and hence utilizing the information 
encoded in the sequence has been a preferred approach for both 
experimental and computational studies [17]. Several predictive 
algorithms have been reported to represent protein sequences in a 
machine-readable format — from conventional one-hot encoding 
and k-mer encoding to more advanced encoding schemes based on 
amino acid properties. For example, the 20 amino acids can be 
clustered into 7 classes based on the side-chain size and charges — 
(AVG), (LFP), (YMTS), (HNQW), (RK), (DE), and (C). Each 
feature would then be an amino acid triad representing the three 
consecutive residues; each feature vector would also have a 
corresponding frequency vector representing the number of times 
a feature occurs in that sequence [18]. An improved version of this 
method involves clustering the amino acids into six categories 
based on their biochemical properties — (IVLM), (FYW), (HKR), 
(DE), (QNTP), and (ACGS), hence, redefining the relative 


- a) String 


— b) Evolutionary Information 


Prediction of Protein Interactions 293 


c) Graph 


Protein structure 


Sequential graph 


EO *e, 
OB 
& 


Spatially interacting * 5 
graph ig. : OF 


Fig. 1 Encoding schemes for proteins. (a) A one-hot-based encoding scheme that allows protein representa- 
tion in the form of the amino acid sequence. (b) An encoding scheme that allows incorporating evolutionary 
information. (c) Graph-based encoding that allows retaining sequential as well as spatial information 


frequencies of the amino acid triads. However, these representa- 
tions generate many zero-valued elements in the feature vectors. 
Since even after scaling and normalization zero-valued elements 
remain zero, not much information is captured, negatively affecting 
the performance of the predictive algorithm. Hence, counting 
dimer residues from position-specific scoring matrices was also 
adopted [19]. Another approach grouped the amino acids into 
four, depending on the chemical properties of the side chains 
(GAVLIMP), (STCNQ), (KRHED), and (FYW) [20]. Termed 
the RFAT system, proteins are represented as a 128-dimensional 
vector with fewer zero elements. 

More recent studies employ neural networks for the extraction 
of global and local sequence features that may be significant in 
interaction prediction. A recent study employs a Siamese recurrent 
convolutional network to capture the influence of protein 
sequences [21]. Stacked autoencoders allow capturing useful infor- 
mation from input data and reconstructing an output, generating 
robust features from protein sequence descriptors [22]. Multichan- 
nel input vectors are also reported to represent the different cate- 
gories of protein features, consisting of information from the 
protein-encoding matrix, the substitution scoring matrices, the 
physicochemical property matrix, and the residue contact energy 
matrix [23]. Some commonly used protein feature extraction 
methods are summarized in Table 6. 


294 Dhvani Sandip Vora et al. 


Table 6 


Sequence-based protein feature extraction methods 


Name Description Descriptors 
Amino acid composition Frequency of each amino acid type in a protein or peptide 20 
sequence 
Composition of k-spaced Frequency of amino acid pairs separated by any k residues Variable 
amino acid pairs 
Tripeptide composition The number of tripeptides represented by amino acid types r,s, 8000 
and t 
Dipeptide composition The number of dipeptides represented by amino acid types rand 400 
s 
Dipeptide deviation from Calculated using dipeptide composition (Dc), theoretical mean Variable 
expected mean (Im), and theoretical variance (Tv) 
Grouped amino acid Based on classes of amino acids according to their Variable 
composition physiochemical properties 
Binary One-hot encoding 20 xn 
Moran correlation Based on the distribution of amino acid properties along the Variable 
sequence 
Geary correlation Determine if adjacent observations of the same phenomenon Variable 
are correlated 
Normalized Moreau- Autocorrelation of a topological structure 21xn 
Broto autocorrelation 
Composition/ Amino acid distribution patterns of a specific structural or 13 
transition/ physicochemical property in a protein or peptide sequence 
distribution 
Conjoint triad Properties of one amino acid and its vicinal amino acids by Variable 
regarding any three continuous amino acids as a single unit 
Sequence-order- Based on distance matrix describing a distance between the two Variable 
coupling number amino acids 
Pseudo-amino acid Based on hydrophobicity values, hydrophilicity values and the Variable 
composition side-chain mass of the 20 natural amino acids 
AAindex Based on physiochemical properties of amino acids 544 


2.1.2 Evolutionary 
Features 


Comparison of a specific protein with similar sequences against a 
reference database allows compiling an alignment, indicating the 
probability of the occurrence of amino acids at each position. Since 
there are 20 canonical amino acids, for each protein of length L, a 
position-specific scoring matrix (PSSM) of dimensions L*20 could 
be constructed [24]. Transforming protein sequences to PSSM 
allows including homology sequence, informative of the evolution- 
ary past. Hence, PSSM-based features incorporate not only 


sequence but also evolutionary features. 


2.1.3 Domain-Based 
Features 


Prediction of Protein Interactions 295 


The PSSMs created by employing PSI-BLAST with e-values set 
to 0.001 are employed for various proteins to generate a novel 
featurization algorithm — PsePSSM. The pseudo-PSSM (PsePSSM) 
allowed predicting membrane interactions of proteins [25] — in 
another approach, considering the L*20 matrix as 20 blocks. The 
Block-PSSM features are then converted to a 1*400 feature vector, 
shown to improve the prediction of protein function [26]. Imple- 
menting PSSMs for prediction of protein interactions allows two 
benefits — there is no special annotation that would be biased 
toward a specific subset of the proteomics data — and, more impor- 
tantly, allows encoding evolutionary information vital to protein 
interaction development. Generating bigram features from PSSMs, 
coupled with the features derived from pseudo-amino acid compo- 
sition for proteins, allowed better prediction of drug target inter- 
actions [27]. Coupling more protein-specific and context 
information also allows for improvement in the performance of 
human PPI prediction algorithms. An example can be found in a 
study that combines features derived from posttranslational modi- 
fication information, codon usage, tissue information, and gene 
ontology. Different classifiers are then trained and shown reliable 
in predicting PPIs in humans [28]. Other similar studies based 
solely on PSSMs as well as incorporating other features have been 
reported to predict protein interactions in various species [29-31 ]. 


The binding specificity of a protein is determined by the structural 
features in the binding pocket of the domain. Domains are compact 
three-dimensional structures formed by conserved stretches of pro- 
tein sequences. Domains are capable of existing and functioning 
independently of the protein. Proteins may contain multiple 
domains. It has been shown that predictions based only on 
sequence features fall short on new data and that this limitation 
may be addressed by including domain information. An earlier 
report showed that the domain information included had a high 
predictive value [32]. Prediction of host-pathogen interactions was 
carried out by developing a novel framework that integrated pub- 
licly available intraspecies protein interaction information with their 
domain profiles [33]. The frequency of interaction of specific 
domain pairs was calculated from the dataset, and the probability 
of interaction of novel protein domain pairs was calculated. 
Another approach involved the integration of protein domain 
information along with sequence features and other protein proper- 
ties to predict virus-host protein-protein interactions. Implement- 
ing linear kernel SVMs, the predictor fared well when trained a 
combination of multiple types of features [34]. 

Among other methods that use domain structure information 
include an empirical force field to calculate energy functions for 
human domain interactions [35] or even the construction of 
position-weighted matrices (PWMs) of all possible SH3 protein- 
ligand complexes using homology modeling [36]. An SVM-based 


296 Dhvani Sandip Vora et al. 


2.1.4 Motif-Based 
Features 


2.1.5 Other Structural 
Features 


predictor trained on domain structure and sequence information of 
various PDZ interactions are also trained and tested on multiple 
organisms for scanning the proteome [37]. 


The three-dimensional organization of conserved protein 
sequences is called a motif, which, unlike domains, cannot retain 
structure and function independently of the protein [38]. Short 
linear motifs (LMs or SLiMs) have been shown to play a crucial role 
as mediators in protein interactions [39, 40]. SLiMs are generally 
two to eight amino acid residues in length and can directly interact 
with protein structures in the same or other proteins. Motif features 
and motif-motif interaction features may be exploited for the pre- 
diction of protein interactions. However, interactions among such 
smaller motifs are different from those of domains — smaller inter- 
action surfaces lead to smaller binding energies and weaker affinities 
yet are important in protein-protein interactions in response to 
cellular environments [41 ]. 

An early study attempted to predict motifs from multiple 
sequence alignments of HIV proteins, incorporating this informa- 
tion to generate a prediction model to estimate HIV-human pro- 
tein-protein interactions [42]. Yet another older report suggested 
using motif information derived from sequences from the eMotif 
database, along with other sequence information, to predict the 
protein interaction reliably using the kernel method [43]. Varying 
lengths of conserved signatures derived from PROSITE, utilized in 
a bag-of-feature approach that does not retain the sequence of 
information, have been recently used to train a confidence-rated 
boosting algorithm to predict drug-protein interactions [44,45]. A 
method to utilize the motif-domain interactions to predict virus- 
host PPIs was also presented that obtained better results than 
previous reports [46]. Motif surface accessibility was included to 
filter the predicted virus-host PPIs to address the issue of false 
positives [47]. 


The three-dimensional arrangement of atoms of the individual 
residues makes up the protein structure. Deciphering the molecular 
functions of a protein often involves examining its structures. In a 
set of proteins and their interacting partners, similar structures tend 
to have similar interactions. Multiple studies have utilized the 
structural features to derive possible interacting partners [48- 
51]. Structural features could include coordinates, electrostatic 
properties, and surface area. Publicly accessible databases like Uni- 
Prot and PDB tend to be essential resources for obtaining sequence 
and structure information. Recently, representing protein struc- 
tures as attributed graphs with residues as nodes and the bonds as 
edges has also shown to be a viable option for training predictive 
models [52, 53]. However, approaches based on the 3D structures 
of biomolecules are limited by the paucity of high-resolution struc- 
tures and experimental benchmarks. 


2.1.6 Network Topology- 
Based Features 


2.1.7 Feature Extraction 
and Encoding of Other 
Binding Partners: DNA, 
RNA, and Small Molecules 


Nucleic Acids: DNA and 
RNA 


Prediction of Protein Interactions 297 


Protein interaction networks are also predicted based on the knowl- 
edge of existing networks. Such methods learn the topological 
connections within a network and predict PPIs without being 
dependent on data from biological sources. Missing interactions 
are predicted based on the interactions already established. Link 
prediction algorithms rely on social network analysis [7]. Other 
methods to predict the structure of a network include “network 
path of length 3,” intrinsic geometry structure, common neighbor, 
collaborative filtering-enhanced topology, and random walk-based 
diffusion propagation [54-58 ]. 


Many cellular processes are governed by the protein-nucleic acid 
(NA) interactions that drive translation, transcription, replication, 
reverse transcription, replication, posttranscriptional processing, 
and transport of RNA and translation and degradation of mRNA. 
Dysregulation in protein-NA interactions leads to various diseases 
[59-61]. DNA- and RNA-binding proteins, hence, form a crucial 
but heterogeneous group of macromolecules. Determining and 
modulating protein-NA interactions are dependent on prior 
knowledge of structure, limited by the experimental determination 
of complexes which is a slow and intensive process [62, 63]. 

Multiple prediction algorithms have been based on various 
approaches to obtain effective features, encompassing most infor- 
mation from biological data (Fig. 2). The more common encoding 
methods have been listed as follows: 

One-hot encoding is the most common encoding scheme for 
DNA and RNA sequences. The four bases are encoded in 1 and 
0 based on the presence at a particular location, resulting in a L*4 
matrix, where L is the length of the sequence. 

Since one-hot encoding results in a low-dimensional feature 
vector, it may be insufficient to retain sequence context informa- 
tion. Hence, extended one-hot encoding methods have been pro- 
posed [64]. A stacked codon-based encoding scheme maps three 
consecutive nucleotides to a pseudo-amino acid. The codon to 
residue map used is standard; however, it is conducted in an over- 
lapping manner due to the uncertainty of the starting site. This 
representation is then converted to a one-hot matrix of L*21 
(20 residues+1 stop codon). 

K-mer encoding is one such method to convert biological 
sequence data to a machine-readable format. RNA or DNA 
sequences can be transformed using the k-mers sparse matrix 
method [65]. Each sequence containing four variables - ACTG in 
DNA and ACUG in RNA ~ is read one nucleotide at a time. A unit 
is “k” nucleotides at a time, and for a sequence of length “L,” the 
total k-mers would be L-k + 1. 

While k-mer encoding is a discrete representation of features, 
the correlation among k-mers cannot be retained. Hence, the need 
for a continuous distributed representation arises. DNA and RNA 


298 Dhvani Sandip Vora et al. 


a) One-Hot 
A Cc 
Achannel 
C channel 
G channel 
U channel 


Word2vec 


Genome-wide ——__-» 
sequences 


Pos [-o8 | 04 | 03 | 


c) Distributed k-mer d) Stacked codon 


ACU UCG GAG 
CUG AAC 


Fig. 2 Encoding schemes for RNA or DNA sequences. A representative short RNA sequence is depicted as an 
example. (a) One-hot encoding, where four channels are present each for the four different nucleotides. The 
presence of each nucleotide is noted across the sequence. (b) k-mer encoding, as an example 3-mer 
encoding is shown. (c) Continuous distributed representations for various sequences can be derived using 
algorithms such as Word2vec. (d) Stacked codon encoding slides a three-base window over the sequence and 
predicts the amino acid for the triplet. The amino acid sequence is then one-hot encoded 


Small Molecules 


are treated as a language, k-mers as words, and RNA sequences as 
sentences. Inspired by the recent developments in natural language 
processing, word embedding methods are adapted to suit 
biological data. For example, word2vec and GloVe embedding 
allow learning of continuous value vectors for k-mers [66, 67 ]. 


The binding of small molecules, or drugs, to a biological target 
induces a change in behavior or function, which leads to changes in 
physiology. The target could be proteins or nucleic acids. Inferred 
by experimental studies of pharmacology or reverse pharmacology, 
establishing drug target interactions is a time-consuming as well as 
costly process [68, 69]. Hence, there is a need for reliable compu- 
tational methods to predict drug target interactions, reducing the 
research space to be covered in the laboratories [70]. Over the past 
few decades, the number of compounds being synthesized is 
increasing rapidly. Yet, the possible target profiles and effects are 
not yet identified. Additionally, there are still a large number of 
diseases that warrant potential cures or at least drugs to manage the 
symptoms. Since information is already available on multiple drugs 
and their biological targets, there is a need to utilize this high- 
dimensional data to build predictive algorithms for determining 


Prediction of Protein Interactions 299 


b) Fingerprint 


1 O NH> 
1 Ho [===> Subgraphs: oe oY 
I y 
2 
Binary bit vector 
- a) String 
O NH 

SMILES: ( IY Node feature 
CNC§BICC 1=CC=C2C(RE 1JOCO2 ) matrix 

SELFIES: J 
(C)(N)(C NEFSRERSMENGY(=C)(C\=C)(C) Adjacency 
(Branch3\=0)(=C\(RingX#N}(0}(C)(0) ii 
(Ring)(#N) 3) Graph 


Fig. 3 Encoding schemes for small molecules. An example of the drug 3,4-methylenedioxymethamphetamine 
is depicted. (a) String-based methods translate the chemical structure to a string that attempts at retaining 
structural information. For example, simplified molecular-input line-entry system (SMILES) focuses on 
localized substructures, while self-referencing embedded strings (SELFIES) guarantee valid molecular struc- 
tures. (b) Chemical fingerprints encode the structure into a binary vector based on the substructures present. 
(c) Graph-based embeddings transform the molecule into a series of nodes and edges depicted by adjacency 
and node features 


novel drug target interactions that could be potentially beneficial. 
Prediction of such interactions will not only help discover new 
drugs but also allow drug repurposing as well as determination of 
potential side effects [71-73]. 

Computational methods of small-molecule interaction predic- 
tion include three approaches — ligand-based, which assumes that 
small molecules interacting with a protein will be structurally simi- 
lar to the natural ligands; the second approach is based on docking, 
which uses the 3D structures of proteins and the drugs to predict 
binding, and a third approach is a chemogenic approach. Extracting 
and implementing the information of the drug and target simulta- 
neously, the chemogenic approach includes both feature-based 
methods and similarity-based methods [74]. A major advantage 
of this method is that it allows utilization of the extensive biological 
data available across various online platforms and public databases. 
Feature-based methods revolve around discovering and imple- 
menting discriminative factors of the drug, target, and interaction 
interface. Hence, accurate representation of the features that serve 
as input to the prediction models is essential. Common data for- 
mats and encoding schemes are discussed in this section (Fig. 3): 


300 


Dhvani Sandip Vora et al. 


(i) String 


Converting small-molecule structures to human-readable as 
well as machine-readable forms may be via various methods, the 
most adopted method being conversion to strings. Among the 
most widely used format to represent chemical compounds as 
strings are the simplified molecular-input line-entry system 
(SMILES) [75]. The compound is described starting from one 
atom, visiting all the others by trimming bonds of the rings. The 
line entry is extended by specific rules for each atom, bond, cycle, 
branch, and stereochemical property. Other string-based represen- 
tations have also been developed to describe better the substruc- 
tures or constraints, such as SMARTS and SELFIES [76, 77]. 

Encoding the SMILES or other representations into words or 
numbers allows training prediction algorithms on the drug features 
derived from the structure. An example of encoding as a mix of 
one-hot and multi-hot vectors is normalizing the number of 
valence electrons and encoding chirality and aromaticity for each 
atom [78]. In many studies, the SMILES representation is directly 
input as a vector to allow the neural networks to extract relevant 
features [79]. Mapping characters to real number vectors allows 
word-like embedding of SMILES, which may be achieved by 
Word2vec. Along with sequential networks like RNN or LSTM, 
these “word” embeddings also serve as powerful representations 
[80, 81]. 


(ii) Fingerprint 


Constitutive scaffolds and certain functional groups occur 
commonly in chemical compounds. These can be used to define 
chemical fingerprints to describe small molecules as a simpler rep- 
resentation of their complex structures [82]. The several ways to 
extract drug fingerprints can be categorized into either topology- 
based or SMARTS-based schemes. The topology-based finger- 
printing schemes include information of the bonds and atoms 
after calculating distances in the molecules, e.g., Morgan, ECFP, 
and 2D pharmacophore. The SMARTS-based fingerprinting con- 
siders the bind orders and aromaticity based on the SMARTS 
profile; PubChem and MACCS are examples of such a fingerprint- 
ing algorithm [83, 84]. 


(iii) Graph-Based 


Weave or graph-neural fingerprints of drugs have been shown 
to be successful in capturing the chemical properties of the com- 
pounds in multiple recent studies. The molecules are converted to 
graph adjacency matrices with atom and bond information. These 
matrices, as inputs to graph convolution networks (GCN), are 
useful in generating the context of nodes. GCNs are of two sub- 
types — spectral and spatial GCNs. Spectral GCNs consider the 
entire graph, while spatial GCNs only consider local subgraphs 
[85, 86]. 


2.2 Applications 


Prediction of Protein Interactions 301 


Additionally, the fusion of multiple features derived from pro- 
teins has also been an area of active research. Multiple methods of 
feature extraction have been developed to improve information 
derived. Treating protein sequence as a set of signals allowed 
implementing an autocovariance-based encoding scheme, which 
considers both the positional amino acid composition and the 
physicochemical properties of the protein [87 ]. Initially introduced 
in the field of theoretical physics, the resonant recognition model 
assigns amino acids a set of physicochemical properties and then 
encodes them as numerical sequences. The degree of correlation 
between the parameters and protein activity or energy of binding 
allows using RRM for protein analysis [88]. Various other encoding 
and representation schemes have been proposed and implemented, 
some of which have been reviewed in a recent publication [89]. 

The identification of protein interactions is essential to the 
study of biological networks, yet their prediction and identification 
are error-prone. The limitations posed by the implementation of 
sequence and structure-only-based features could be overcome by 
incorporating high-throughput biological data including, but not 
limited to, microarrays, next-generation RNA sequencing reads, 
and expressed sequence tags [90-92]. Inclusion of the gene expres- 
sion profile features would allow identification of gene products 
that change expression together with some other factor, regulated 
by mechanisms that can be unraveled by such studies. However, the 
data used as features should be obtained using standard experimen- 
tal techniques and standard pipelines for data analysis to ensure the 
robustness of the algorithm. Since interrelationships between pro- 
tein interaction and gene expression profiles may be hard to eluci- 
date, statistical measures of correlation and setting significance 
thresholds are implemented before integration into training data 
for prediction algorithms. 


Understanding the relationship among various biological entities is 
equally vital as the mere knowledge of their existence in formalizing 
many biological processes. For instance, cell differentiation is 
dependent on both the types of proteins present in the system 
and their associations. High-throughput technology has made 
biological network studies possible and allowed for progress in 
open problems related to drug target discovery and pathway analy- 
sis. Further, in the era of “big data,” extracting knowledge from the 
enormous amount of data has become a vital part of most domains, 
including biology and bioinformatics [93]. Machine learning 
(ML) has proven to be an efficient tool to discover underlying 
patterns in biological networks, build models, and make a predic- 
tion based on the most robust model. ML algorithms, including 
Bayesian networks, random forests, support vector machines, and 
hidden Markov models, have extensively been used in genomics, 
proteomics, and systems biology [94]. 


302 Dhvani Sandip Vora et al. 


2.2.1 PPI Networks to 
Understand Disease 


2.2.2 Protein Function 
Prediction 


Recently, deep learning (DL)-based solutions have seen 
unprecedented applications in diverse fields like machine vision, 
signal processing, natural language processing, and computational 
biology due to their ability to model complex data. For instance, 
IBM’s Watson and Google’s AlphaFold have achieved great success 
toward solving critical problems in clinical oncology and protein 
folding, respectively [95, 96]. The biggest advantage of DL-based 
methods relies on the fact that a problem is solved by passing input 
signals to simulate a network and recognize intricate patterns. 
Artificial neural networks (ANNs), which are the fundamental 
building blocks of most deep learning architectures, closely resem- 
ble the working of neurons in the human brain. DL can combine 
simpler features and learn complex substructures in data. In other 
words, with the presence of nonlinearity in stacked layers of a DL 
architecture, data can be hierarchically represented with an increas- 
ing level of abstraction. 


Most of our current knowledge of the etiology of various diseases 
comes from approaches aiming to uncover their genetic basis. The 
ability to generate individual genome data with next-generation 
sequencing methods promises to change the field of translational 
bioinformatics even more. Therefore, it is necessary to identify 
molecules and mechanisms triggering, participating, and 
controlling perturbed biological processes for understanding the 
biological intricacies of pathogenesis and disease progression. Deci- 
phering such molecular mechanisms leading to diseased states is an 
even bigger challenge than elucidating the genetic basis of complex 
diseases. Even when the genetic basis of a disease is well under- 
stood, not much is known about the molecular details leading to 
the disorders. 


Functional annotation of proteins plays a crucial role in identifying 
disease-causing aberrations in genes or proteins, understanding 
cellular mechanisms, and developing tools for prevention, diagno- 
sis, and treatment of disease. Complex relationships among geno- 
type and phenotype have guided the analysis of genome-wide 
molecular interaction data. Multiple databases have curated and 
integrated such heterogeneous data at varied extents of biological 
complexity [97]. As opposed to manual curation, other methods 
try to extend the primary data with predictions and indirect associ- 
ation to estimate a bigger picture of the biological process [98- 
100]. Similarly, functional annotation of a newly sequenced protein 
is performed using homology mapping or by identifying functional 
domains from preexisting databases. BLAST, FAST, Pfam, Pro- 
Dom, and SCOP are some of the commonly used homology- 
based methods [24, 101-103]. Such models are often guided by 
the “guilt by association” principle that works under the assump- 
tion that adjacent nodes in a network have more functional similar- 
ity in comparison to farther nodes. 


Table 7 
Application of deep neural 


Prediction of Protein Interactions 303 


networks, recurrent neural networks, and emergent architectures in tasks 


involving biological datasets 


Omics Signal processing 
Research topic Reference Research topic Reference 
Deep neural networks Protein structure [222-224] Brain decoding [225] 


Recurrent neural networks 


Emergent architectures 


Gene expression regulation [226-228] Anomaly classification [229] 
Protein classification [230] 


Anomaly classification [231] 
Protein structure [232] Brain decoding [233] 
Gene expression regulation [234] Anomaly classification [235] 
Protein classification [236] 
Protein structure [237] Brain decoding [235] 


2.2.3 Protein-Drug 
Interaction Site Prediction 
Using PPls 


2.3 Template-Based 
Methods of Protein 
Interaction Prediction 


PPI-based computational methods to determine protein func- 
tion are limited due to the lack of uniformity in the network 
topology. While contrasting functions arise from different gene 
sets, the prediction accuracy is largely affected by the number of 
neighbors or choice of distance metric. DNNs have been exten- 
sively used for the prediction of protein function. 


Targeted therapies greatly benefit from the functional discovery of 
candidate disease-related genes. Computational methods based on 
PPI profiles are extensively summarized in __ literature 
[104, 105]. One such implementation involves network construc- 
tion aided by gene expression profiles to identify critical nodes in a 
biological network. The primary goal of any method involving PPIs 
is to identify critical nodes and their neighbors in the network as 
therapeutic targets, based on the rationale that protein interactions 
play an important role in systematic aberrations. This led to a 
notion of “guilt by association” that assumed that entities related 
to a known disease-causing agent (gene/protein) are likely to be 


involved in the disease. Some typical applications are enlisted in 
Table 7. 


Recent advances in sequencing have allowed the generation of a 
wealth of protein data. Integrating and extracting relevant data 
from such diverse and extensive sources demand computational 
methods. Besides machine learning-based methods, protein inter- 
actions can be predicted by various methods, e.g., interolog identi- 
fication, gene coexpression, and gene cluster analysis. Some of the 
methods have been mentioned below. 

As protein complexes are increasingly purified and the struc- 
tural data is made available, template-based predictions of protein 
complexes are attracting attention. The interactions between pro- 
teins and networks are modeled based on the similarity of sequence 


304 Dhvani Sandip Vora et al. 


2.3.1 Homology-Based 
Approaches 


or structure of other protein complexes, also known as the tem- 
plates. The method involves identifying a nonredundant dataset of 
templates for the protein complexes to predict and then evaluating 
the predictions based on some scoring function. The scoring func- 
tion is generally statistical potential or the energy of the complex. 
Template-based methods tend to be more efficient than docking, 
especially at a proteome scale, helping limit the number of possible 
favorable conformations [106, 107]. 

Assembling an inclusive but nonredundant dataset of templates 
is crucial to template-based methods — templates, if not correctly 
selected, could lead to false positives or false negatives. A limitation 
of this approach is the unavailability of similar templates for pro- 
teins. Complex structures and novel interactions cannot be pre- 
dicted in the absence of templates. However, when suitable 
templates are available, the prediction algorithms are reliable and 
fast. The increasing number of protein and protein complex struc- 
tures being deposited in PDB promises template-based algorithms 
will continue gaining attention. Broadly, template-based methods 
can be divided into two major categories: homology-based and 
interface-based algorithms. 


Template-based prediction methods are based on the finding that 
proteins with an identity of at least 30-40% associate similarly 
[108]. However, exceptions to the findings also exist [109]. Pre- 
dicting interactors based on sequence homologs scored on the basis 
of empirical potentials derived from experimentally established 
interactions has been used widely. Knowledge-based potentials are 
easy to implement and have been shown to be successful in pre- 
dicting interologs, i.e., interaction homologs [110, 111]. A web- 
server designed on a similar approach, utilizing Blast2 as a homolog 
search algorithm — InterPreTS — predicted protein interactions 
based on this knowledge-based scoring method [112]. A database 
of experimental- and template-based interaction models for various 
species constructed the GWIDD consists of structural representa- 
tions of multiple genomes [113]. 

An advantage of homology-based methods is that the bound 
state of even unstructured proteins can be predicted by comparing 
with similar proteins. Another approach named WSsas method, 
which maps queried protein sequences to known structures on 
the basis of the functional residues of homologous proteins, has 
also shown to be promising [114]. Distinct members of the same 
family of domains are also known to associate in a similar manner. 
Hence, the integration of domain information is incorporated into 
predicted protein complexes [115]. The matched domains scored 
by statistical potentials derived from side-chain contacts are shown 
to distinguish non-native contacts accurately [116]. Scoring and 
discrimination based on dynamics of interface residues have also 
been applied [117]. Additional methods reported to predict 


2.3.2  Interface-Based 
Methods 


2.3.3 Gene-Based 
Methods 


Prediction of Protein Interactions 305 


structural interaction based on homology include machine learning 
techniques that combine geometric, physicochemical, and similar- 
ity information, covered in detail in sections that follow. 


Although the structure of a protein tends to be more conserved 
than its sequence, interface residues are evolutionarily more con- 
served than the structure [118]. Hence, it is also possible that 
entirely different protein pairs may share similar interaction inter- 
face frameworks [119]. Multiple such reports prompted the idea 
that implementing information derived from the interface regions 
alone, independent of the sequence and global structure, could 
predict protein complex interactions adequately. Homology-based 
methods fall short when protein sequence similarity is low. How- 
ever, interface-based methods for prediction are sequence- 
independent. 

The first algorithm to implement a surface-based prediction 
was PRISM [120]. PRISM combines evolutionary information 
and geometric complementarity, allowing predicting target pro- 
teins by homologous spatial motif search. In a later study, the 
sequence homology and global fold parameters are found to be of 
lesser importance than the local structural alignments [121]. It was 
also reported that the conservation at the interface is useful for 
predicting interactions for even evolutionarily remote proteins. 
PredUs implements this knowledge for protein binding prediction 
in a diverse structural dataset [122]. 


Proteins that are likely to interact may be identified by studying 
gene coexpression data. Clustering algorithms can group together 
genes with similar expression profiles. Individuals of such a cluster 
may be considered as functional association candidates and even 
physical binding partners. Such candidates can be validated by 
checking multiple time points or states [123]. However, a draw- 
back of this method is that expression data can be high-throughput 
and noisy. Protein levels also do not perfectly correlate with gene 
expression levels, thereby yielding misleading interaction 
information. 

A group of genes within a set intergenic distance are called gene 
clusters. Ranging from a few to more than a hundred genes, clusters 
house genes which have related functions and potential interactors. 
In bacteria, genes housed in operons are transcribed together. In 
eukaryotes, gene clusters are coregulated. It has been observed that 
genes involved in the same cellular pathway are often present in 
close proximity [124]. Conversely, if the genes are not in proximity 
in the genome, this approach cannot identify interactions. Multiple 
resources exist to implement this method [125, 126]. Since gene 
clusters indicate functional rather than physical interactions, they 
are a simple approach but depend heavily on the number of gen- 
omes used as reference. 


306 Dhvani Sandip Vora et al. 


2.3.4 Network Topology- 
Based Approaches 


2.4 Learning-Based 
Methods to Identify 
Protein Interactions 


2.4.1 Machine Learning- 
Based Methods 


Protein interaction networks have begun to be commonly repre- 
sented as graphs- proteins form the nodes and the association 
between proteins from the edges. Topological features extracted 
from protein interaction graphs would indicate the number of 
direct or indirect neighbors, shortest paths, etc. “Hubs” in such 
graphs are a small number of proteins that have multiple interaction 
partners. These hubs serve as centers of function and integrity of 
cellular processes. A mathematical representation of such protein 
interaction networks allows the identification of functional rela- 
tions and novel interactions. If proteins have multiple common 
interactors in the network, they can be assumed to be a part of 
similar processes [127, 128]. For proteins with shared interaction 
partners, the structure, sequence, or biochemistry can be assumed 
to be similar. Protein interactions can be predicted by just the 
topological features independent of prior knowledge of sequence 
or structure [129]. Integration of protein sequence and function 
information would also allow better prediction of protein com- 
plexes [130]. Detection of conserved interactions and predicting 
novel interactions could also be achieved by alignment of various 
protein interaction networks, which has been reviewed in detail 
elsewhere [131]. 


It has been observed that in silico predictions of PPIs depict similar 
accuracy when compared with large-scale experimental PPI data- 
sets. Furthermore, machine learning algorithms that are quick and 
scalable can improve the efficiency of experimental methods when 
used in tandem [132]. Machine learning techniques used for pre- 
dicting PPIs can be broadly classified into two categories: super- 
vised and unsupervised. It is based on whether the input variables 
need to be labeled according to the expected outcome or not. In 
general, supervised learning infers a mapping function from given 
input-output pairs that can be used to train a model for predicting 
outcomes for other inputs. On the other hand, unsupervised 
learning discovers the hidden structure within unlabeled training 
data for drawing meaningful inferences. 

Artificial neural networks (ANNs), support vector machines 
(SVMs), Bayesian inference, and decision tree-based methods 
such as random forest (RF) are some of the supervised learning 
algorithms that are used for predicting PPIs [133]. Supervised 
machine learning is implemented for classification problems, i.e., 
segregating input data points into specific classes, where a set of 
quantitative or categorical features are analyzed for features that are 
capable of discriminating given input variables into specified classes. 
Figure 4 represents a schematic describing the various types of ML 
algorithms and their general use cases. Clustering techniques fall 
under unsupervised learning, where methods such as k-means, 
single-linkage, and spectral clustering are used for PPI prediction. 


Traditional programming 


- 
hi 
H 
H 
' 
H 
H 
H 
H 
H 
i 
H 
' 
H 
H 
i 
H 
H 
H 
H 
H 
H 
H 
H 
H 
H 
l 
\ 


Prediction of Protein Interactions 307 


UNSUPERVISED 
LEARNING CLUSTERING 


Machine Learning Group/cluster data 


based on inputs 
DATA ALGORITHM DATA OUTPUT 
CLASSIFICATION 
MACHINE MACHINE SUPERVISED 


LEARNING 


OUTPUT ALGORITHM Develop predictive 
3 model based on REGRESSION 


inputs and outputs 


Fig. 4 Machine learning and its derivatives. A schematic highlighting the fundamental difference between 
traditional programming and machine learning. Broad categories of ML, based on the algorithm and problem, 


are also compiled 


PPI prediction, which is generally a binary classification task, 
has two categories: the “positive” (p) class, containing proteins that 
interact with each other, and the “negative” (n) class, containing 
proteins that do not interact. A given instance or data point is 
classified as “positive” if the computed score (represented as a 
random variable X) is above a given threshold and “negative” 
otherwise. A given prediction can fall under one of the four cate- 
gories for a binary classification task, namely: 


1. True positive (TP): Proteins interacting and correct inference 
by model as interacting partners. 


2. True negative (TN): Proteins not interacting and correct infer- 
ence by model as noninteracting. 


3. False positive: Proteins not interacting but incorrect inference 
by model as interacting. 


4. False negative: Proteins interacting but incorrect inference by 
model as not interacting. 


The datasets for model training, validation, and testing can be 
prepared using techniques such as randomization, cross-validation, 
and bootstrapping. Given a large enough dataset, it can be divided 
randomly into “k” equal parts. Each of these “k” parts can then be 
randomly used as training and testing sets. This randomization 
ensures the random sampling of training and testing sets that is 
vital for averting any selection bias during the training process. 
However, as in many cases, large enough datasets required for 
proper randomization are rarely available. In such cases, the same 
dataset is repeatedly split into training and testing sets in different 
ways by a technique called cross-validation or rotation estimation. 
These techniques can be exhaustive where cross-validation involves 
either leave-p-out cross-validation (LpOCV), where p observations 
are set aside as the test set and the remaining observations are taken 
as the training set or leave-one-out cross-validation (LOOCV), 


308 Dhvani Sandip Vora et al. 


2.4.2 Decision Tree- 
Based Method 


2.4.3 Probabilistic/ 
Bayesian Classification 


where p= 1. On the other hand, k-fold cross-validation is the most 
common form of non-exhaustive cross-validation, where the origi- 
nal sample is randomly partitioned into k equal-sized subsamples, 
from which a single subsample is retained as the test set and the 
remaining subsamples are used as the training set. The process is 
then repeated k times, with each of the k subsamples used exactly 
once in the test set. A common example is the fivefold cross- 
validation, where the training dataset is divided into five subsets, 
of which four subsets are used in training the model and the 
remaining one is used for testing it, and the process is repeated 
five times, using a different subset in each iteration. 


Decision tree algorithms involve recursive partitioning of the input 
space by selecting the best attribute and expanding the leaf nodes of 
the tree until a predefined stopping criterion is attained. For 
instance, a simple criteria could be the minimum number of train- 
ing instances assigned to each leaf node of the tree. The best test 
condition for splitting is determined by different algorithms using 
different metrics such as Gini impurity and information gain. Gini 
impurity is a measure of misclassification that denotes the probabil- 
ity of a randomly chosen element from the set being incorrectly 
labeled according to the distribution of labels in the subset. 


Biologists have a strong preference for Bayesian-probabilistic clas- 
sifiers due to their diverse functionalities. Moreover, while machine 
learning solutions like SVMs and neural networks are considered 
black-box models due to their limited interpretability, Bayesian- 
probabilistic models are more natural and can handle numerical as 
well as categorical data. Algorithms that use conditional probability 
distributions as a way to model relationships among features of 
training samples and their class labels are called probabilistic classi- 
fiers. For instance, if the features of input data are denoted by 
xi(z = 1,..., M), then the feature vector for each data point can 
be represented as x = [xl,x2,...,xM] and the probability of the 
data point belonging to each of the N classes (c= cl, ¢2,..., cN) as 
AG= él|.#), (C= 2) a), 44 PCH Ens); 

After modeling the class conditional probabilities, the probabi- 
listic approach seeks to classify the input data points to the class 
with the maximum probability. In case of a binary classification 
problem, this translates to computing the ratio Y= P(x| C = cl) 
P(x| C = c2) and then choosing cl if Y > 1, and c2 otherwise, 
because the decision boundary is formed by the region of the 
feature space where Y= 1. 

Probabilistic methods based on log-odds scoring schemes have 
been widely used in PPI prediction, as well as for filtering high- 
throughput experimental datasets that can potentially include sev- 
eral FPs. Genomic features such as coexpression values, essentiality 
and co-localization, structural features, and sequence signatures 
have been used under Bayesian-probabilistic frameworks for PPI 
prediction [134]. 


2.4.4 Artificial Neural 
Networks 


2.4.5 Clustering 


2.5 Challenges and 
Limitations 


2.5.1 Reliability of 
Protein Interaction Data 


Prediction of Protein Interactions 309 


Artificial neural network (ANN) is one of the oldest machine 
learning algorithms used to perform nonlinear statistical modeling 
and develop binary classification models and is now evolving into 
state-of-the-art deep learning algorithms such as recurrent neural 
networks and stacked autoencoders among others. Due to their 
huge model capacity, ANNs generally require less statistical training 
and are able to implicitly detect complex nonlinear relationships 
between dependent and independent variables and detect all possi- 
ble interactions between predictor variables. 

ANNs are made up of a network of connections, where each of 
the individual elements transfer information with upstream /down- 
stream neurons. Each connection is assigned a trainable parameter 
called weight (177). The propagation function pj(t) = & to t)wiy 
computes the input 7(¢) to the neuron j from the outputs 07(¢) of 
predecessor neurons. 


Clustering is the primary form of unsupervised machine learning 
technique for classification problems, which tries to segregate data 
points into groups such that data points placed in the same group 
are more like each other than to those in other groups. Clustering is 
useful in exploratory pattern analysis, pattern classification, and 
decision making and for outlier detection. Also, clustering is pri- 
marily used in the cases when the class labels are not known in 
advance. The main advantage of clustering is its ability to determine 
the intrinsic classification within a set of unlabeled data, hence not 
requiring a separate training stage. 

The different distance metrics used by clustering algorithms 
include (a) Euclidean distance metric, (b) Euclidean squared dis- 
tance metric, (c) Manhattan (city block) distance, (d) Chebyshev 
distance, (e) Pearson’s correlation coefficient, (f) squared Pearson’s 
correlation coefficient, and (g) Spearman’s rank correlation coeffi- 
cient. A general schematic describing all the associated elements of 
a simple neural network is described in Fig. 5. 


The rapid development of tools for experimentally identifying pro- 
tein interactions has been accompanied by computational methods 
for the analysis of experimental data as well as the prediction of 
novel interactions. Despite the remarkable progress in the develop- 
ment of technical and analytical tools for the identification and 
prediction of protein interaction networks, obstacles remain — 
some inherent to the field and some unique to each approach. 
Some difficulties encountered while implementing DL methods 
have also been listed. 


Many experimental studies have reported multiple protein interac- 
tion networks, and with high-throughput studies comes the inevi- 
table problem of noise. Limited by various factors, studies also yield 
numerous false negatives like transient or cell stage-specific protein 


310 Dhvani Sandip Vora et al. 


2.5.2 Data Integration 


2.5.3 Dynamic Protein 
Network Construction 


2.5.4 Evaluation of 
Protein Interaction 
Networks 


sae] ynduj 


4aAe] yndjno 


J N@S@é2s 


Hidden layers 


Fig. 5 Overview of a simple neural network architecture. Data, in the form of a 
feature matrix, is fed to the network following by nonlinear transformations in the 
hidden layers. The parameters of these hidden layers are heuristically optimized 
to produce the required output 


interactions that may not be detected. Hence, filtering out noisy 
data before analysis and integration is essential. Moreover, reducing 
the false negatives arising out of noisy and incomplete interaction 
data is another challenge that needs to be tackled. 


The reliable analysis of the protein interaction network centrality, 
modularity, and dynamics is hindered by input data noise. A more 
comprehensive analysis could be obtained by integrating data from 
multiple biological sources, such as RNA-Seq data, protein domain 
information, cellular localization, etc. However, effective integra- 
tion of data from multiple biological sources is a research area with 
many gaps. 


The inherent transient properties of protein expression and inter- 
action have shifted the focus from understanding static to dynamic 
networks. Recently, the integration of time series data of expression 
and static interactions has been proposed. Yet, the accuracy of 
expression and the number of time points needed are limiting 
factors for the applicability of such methods. High-sensitivity and 
high-throughput technologies are slowly replacing traditional 
time-intensive methods, yet spatial and temporal analysis remains 
hindered by the tissue and cell location-specific processes. Hence, 
combining multiple data sources across various cell-specific and 
temporal analyses will emerge as a hotspot in the field of computa- 
tional biology. 


The computational analysis and prediction of protein interaction 
networks depend on the reliability of the experimental data. Hence, 
there is a need to evaluate the quality of the interaction networks in 
a manner that is not sensitive to the experimental conditions. 


2.5.5 Lack of Data 


2.5.6  Overfitting 


2.5.7 Data Imbalance 


Prediction of Protein Interactions 311 


Deep learning methods are data-hungry. A lot of data is required to 
develop a robust and accurate deep learning algorithm [135]. In 
certain biological cases, available data may not be enough. Data 
could be collected from similar tasks and transfer learning applied 
to design a better mapping function [136]. However, it is unknown 
if this approach would allow a sufficient representation of the 
original data. Modifying well-trained models from similar tasks to 
fit the available data is a viable alternative [137]. Especially in the 
case of image data, rotation and mirroring do not generally change 
the labels of the data. Nevertheless, in the case of biological 
sequence or structure, such approaches should be implemented 
with care. Simulated data may also be used to add to the available 
data, provided the physical processes are well-understood and the 
simulators yield reliable samples [138]. 


DL models are generally high complexity models that deal with a 
large number of parameters, which is the reason such algorithms are 
at risk of overfitting, i.e., performing well on the train data but 
unable to generalize to the test data [139]. Although some recent 
studies suggest that the implicit bias of the training process deals 
with the issue of overfitting, certain cases demand specialized tech- 
niques to make models robust [135, 140, 141]. Various algorithms 
have been developed in the past few years to induce generalizability 
and can mainly be classified into three categories. The first type, 
based on model parameters and architecture, includes dropout, 
batch normalization, and weight decay [139, 142, 143]. The sec- 
ond type acts on the inputs — data augmentation and data corrup- 
tion techniques [144]. A third type acts on the model output by 
penalizing overconfident predictions [145]. A detailed review of 
the types of regularizations and their benefits can be found in a 
recent review [146]. 


Biological data is generally seen to be imbalanced — positive samples 
outnumbered manyfold by the negative samples [147]. Training a 
model on imbalanced data introduces a bias toward the majority 
class, but reducing the amount of data used to train the model also 
reduces the information that can be extracted from the whole 
dataset. Proper performance measurement criteria should be used 
to evaluate the predictions to overcome the challenge of imbalance, 
measuring performance on both classes [148, 149]. Another 
method involves modifying the loss function to penalize the 
model if it underperforms on the minority class. The input data 
could also be under- or up-sampled when training the predictor. 
Where the data can be arranged hierarchically, different models can 
be built for each level [150]. 


312 Dhvani Sandip Vora et al. 


2.5.8 Lack of 
Interpretability 


2.5.9 Uncertainty Scaling 


2.5.10 Catastrophic 
Forgetting 


2.6 Standard 
Modeling Protocol 


2.6.1 Computing 
Resources 


Deep learning methods had been criticized for being black-box 
models. Recent times have seen an increasing effort toward improv- 
ing the interpretability of these models. Especially in biology, it is 
important that the prediction model used is interpretable to allow 
an understanding of the features — motifs, sequences, or structures — 
which may be important for the process being studied. Many algo- 
rithms that derive feature importance from deep learning models 
assign example-specific importance scores. Reliable scores may be 
achieved by employing perturbation-based or backpropagation- 
based approaches. Perturbation-based methods alter parts of the 
input and measure the effect on the model output [151- 
154]. Backpropagation-based methods allow a signal from the 
output layer to be sent backward to the input layer to determine 
the importance of the input [155, 156]. Although such methods 
have been shown to be useful in multiple cases, they are still under 
active development. 


Machine learning models not only perform prediction but also give 
a confidence score for each query of the model [157]. The confi- 
dence score informs the users of the reliability of the predictions. In 
biological problems, confidence scores prevent building on mis- 
leading and unreliable model outcomes. In such cases, scaling the 
scores to evaluate the actual risk in the given context is important. 
Probability scores from the softmax algorithm are usually overcon- 
fident predictions and hence not on the right scale [145 ]. Obtaining 
reliable outputs would involve post-scaling to softmax outputs. 
Examples of methods to perform scaling include Platt scaling, 
Bayesian binning, and histogram binning [157-159]. A recently 
proposed temperature scaling showed much better results than 
other methods [160]. 


Catastrophic forgetting is when a deep learning model is unable to 
learn and remember different tasks that may be not explicitly 
labeled, may switch unpredictably, or may occur sequentially 
[161]. This is common in biology where the data is continually 
accumulating and changing. Training new models from scratch 
after incorporating the new data seems like a plausible solution, 
though it is computationally intensive and time-consuming. At 
present, three types of methods are employed to deal with cata- 
strophic forgetting, based on regularizations, using dynamic neural 
network architectures and rehearsal training methods, and the last 
kind being based on dual-memory learning systems [161-164]. 


Most ML workflows can be implemented on a standard Unix 
workstation in standard configuration. It can also be equipped 
with a graphics processing unit (GPU) to train deep learning mod- 
els. The exact specifications of the machine would vary depending 
on the size of the dataset and model architecture. In addition to a 
CUDA-capable GPU and its suitable drivers, CUDA (https:// 


Software Installations 


Machine Learning 
Frameworks 


2.6.2 Data Processing, 
Model Building, and 
Evaluation 


Prediction of Protein Interactions 313 


developer.nvidia.com/cuda-toolkit) is an underlying parallel com- 
puting platform, which must be separately installed for training 
deep learning models. Additionally, as current deep learning frame- 
works like TensorFlow and PyTorch are implemented on the 
Python programming language, the user should have a certain 
level of familiarity with the language. 


It is generally recommended that all the required packages be 
installed in a virtual environment. This can be easily managed by 
any environment manager like Conda (https://docs.conda.io/en/ 
latest /). 


As mentioned earlier, multiple machine learning frameworks are 
available with active development and extensive community sup- 
port. scikit-learn (https://scikit-learn.org/stable/), TensorFlow 
(https://tensorflow.org), Theano (http://deeplearning.net/soft 
ware/theano/), and PyTorch (https://pytorch.org) are some of 
the machine learning and deep learning frameworks. 


As described earlier, there are multiple methods and sources of 
processing protein-protein interaction data. For instance, PRO- 
FEAT ((http://bidd.group/cgi-bin/profeat2016/ligand/pro 
fnew.cgi) is a commonly used web server to calculate 
physiochemical and structural features from a given protein 
sequence. Similarly, Propy (https://pypi.org/project/ 
propy3/1.0.0a2/) and iFeature are other python packages that 
can compute a large number of sequence features (amino acid 
compositions, dipeptide compositions, Moran autocorrelation 
descriptors, Geary autocorrelation descriptors, composition, tran- 
sition, distribution among others). A general overview of the vari- 
ous steps in an entire ML-/DL-based PPI prediction workflow is 
summarized in Fig. 6. 

Annotated protein-protein interaction data is retrieved and 
processed from openly available databases or novel experimental 
sources. Processed data is then divided into training, testing, and 
validation set, generally in 80/10/10 proportions. However, this 
ratio can be altered depending on the size of the dataset. Assuming 
that the entire data follows the same underlying distribution, this 
segregation can be performed randomly. However, cross-validation 
techniques are employed to avoid misleading results. 

Data and labels must be appropriately scaled before training to 
ensure that different features with disproportionate scales are stan- 
dardized to prevent the accumulation of large weights and the 
formation of skewed gradients in the system during the training 
process. Generally, a standard scaler (zero mean, unit variance) is 
employed for the same. Alternatively, min-max scaling or log trans- 
formations are also employed depending on the use case. While 
training, hyperparameter optimization plays a critical role in the 
model’s overall performance. The set of hyperparameters forms a 


314 Dhvani Sandip Vora et al. 


Sequence based 


Sequence Homology 
Motif/Domain based 
Correlated mutation 


[omen anne anne enna nnn n nena nnn nn enn ene ees, 


Carrara eee ote | 


1. FEATURE EXTRACTION |-; 


Structure based 


Structural homology 


Support Vector Machines 


2. FEATURE ENGINEERING |---~----------------~----------------5 


3D characteristics 
Depth/Protrusion 
Solvent accessibility 


weeeenenennnneneen5, 


Physiochemical properties 


Sequence Homology Sequence Homology 
Motif/Domain based Motif/Domain based 
Correlated mutation Correlated mutation 


Protein docking 
Surface patches 


peasascnccaacnssaces 


Deep Learning 


Feed forward Neural Multilayer Perceptron Recurrent Neural nets 
networks Convolutional Neural Geometric Deep Learning 
Autoencoders network 3D CNNs 


Fig. 6 Overview of a standard machine learning or deep learning implementation. Data retrieval, feature 
extraction, feature engineering, and model optimization are some of the major elements of any ML/DL 


workflow 


2.7. Perspectives 


space that is tuned using the training data to minimize the error. 
However, architectures such as CNNs have additional hyperpara- 
meters like filter size and stride length that need to be separately 
optimized using grid search. 

Although most machine learning and deep learning methods 
are considered black-box predictors with very little interpretability, 
attention mechanisms and utilities such as saliency analysis and 
GradCam enable to deduce the relative importance of individual 
features in the data. 


With the advancement of high-throughput techniques in biology, 
deep learning (DL) has rapidly become a widely used powerful 
technique to achieve various goals. Judging by the current popular- 
ity of DL in the field of biology, from inferring protein localization 
in a cell to predicting binding events, it can be safely assumed that 
DL will dominate the field for the upcoming few years [165- 
167]. In addition to proteomics, many other domains have pro- 
gressed toward omics scale — transcriptomics, genomics, and lipi- 
domics, to name a few. Biological networks inferred are usually 
incomplete, especially when derived from a single-omics study. 
Deciphering biological complexity across the different layers with 
improved descriptive capability may be achieved by integration of 
multi-omics approaches. Multiple tools and methods have been 
designed for analysis and integration of multi-omics data, which 
offer numerous applications in protein interaction studies along 
with many fields [168]. 


Prediction of Protein Interactions 315 


Recently, natural language processing models are increasingly 
applied in bioinformatics to simplify DNA and protein sequence- 
based problems. With amino acid and nucleotide sequences being 
interpreted as meaningful sentences, language models allow deter- 
mination of structure, function, and their interrelationships. How- 
ever, a fundamental problem in such applications is how to define 
biological “words.” Recently developed techniques are aimed at 
solving this problem — byte pair encoding (BPE) [169] and uni- 
gram language model (ULM) [170]. Another tool, SentencePiece, 
developed to directly generate words, integrates both the BPE and 
ULM algorithms [171]. Also, as chemical structures of biomole- 
cules become increasingly available, the development of quantum 
mechanics and quantum machine learning techniques may signifi- 
cantly affect the future of computational biology [172-174]. 

Advances in computational methods to infer protein interac- 
tions accelerate the discovery of molecular mechanisms of cellular 
pathways and diseases and subsequently promote drug discovery 
and novel diagnosis methods. This chapter covers traditional and 
recent protein interaction network discovery and prediction meth- 
ods, briefly mentioning the challenges associated with them. With 
proteomics research entering the multidisciplinary integration 
stage, deep learning techniques will continue to be integrated 


into computational proteomics. 


References 


1; 


Li J, You Z, Li X et al (2017) PSPEL: in silico 
prediction of self-interacting proteins from 
amino acids sequences using ensemble 
learning. IEEE/ACM Trans Comput Biol 
Bioinform 14(5):1165-1172 


. Huang Y-A, You Z-H, Chen X et al (2016) 


Improved protein-protein interactions predic- 
tion via weighted sparse representation model 
combining continuous wavelet descriptor and 
PseAA composition. BMC Syst Biol 10(4): 
120 


. Davis AM, Teague SJ, Kleywegt GJ. (2003) 


Application and limitations of X-ray crystallo- 
graphic data in structure-based ligand and 
drug design. (1433-7851 (Print)) 


. Gavin A-C, Bésche M, Krause R et al (2002) 


Functional organization of the yeast prote- 
ome by systematic analysis of protein com- 
plexes. Nature 415(6868):141-147 


. Zhu H, Bilgin M, Bangham R et al (2001) 


Global analysis of protein activities using pro- 
teome chips. Science 293(5537):2101-2105 


. Ho Y, Gruhler A, Heilbut A et al (2002) 


Systematic identification of protein complexes 


in Saccharomyces cerevisiae by mass spec- 
trometry. Nature 415(6868):180-183 


. Keskin O, Tuncbag N, Gursoy A (2016) Pre- 


dicting protein-protein interactions from the 
molecular to the proteome level. Chem Rev 
116(8):4884-4909 


. Venkatesan K, Rual J-F, Vazquez A et al 


(2009) An empirical framework for binary 
interactome mapping. Nat Methods 6(1): 
83-90 


. Luck K, Kim D-K, Lambourne L et al (2020) 


A reference map of the human binary protein 
interactome. Nature 580(7803):402-408 


10. Walhout AJM, Boulton SJ, Vidal M (2000) 


11. 


Yeast two-hybrid systems and protein interac- 
tion mapping projects for yeast and worm. 
Yeast 17(2):88-94 

Rain J-C, Selig L, De Reuse H et al (2001) 
The protein-protein interaction map of Heli- 
cobacter pylori. Nature 409(6817):211-215 


12. Alonso-Lopez D, Gutiérrez MA, Lopes KP 


et al (2016) APID interactomes: providing 
proteome-based interactomes with controlled 
quality for multiple species and derived 


316 


13. 


14. 


15. 


16. 


17 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


Dhvani Sandip Vora et al. 


networks. Nucleic Acids Res 


W529-WW35 

Bock JR, Gough DA (2001) Predicting pro- 
tein—protein interactions from primary struc- 
ture. Bioinformatics 17(5):455—-460 
Sprinzak E, Margalit H (2001) Correlated 
sequence-signatures as markers of protein- 
protein interaction. J Mol Biol 311(4): 
681-692 

Zhou HX, Shan Y (2001) Prediction of pro- 
tein interaction sites from sequence profile 
and residue neighbor list. Proteins: Struc 
Funct Bioinform 44(3):336-343 

Sanger F (1952) The arrangement of amino 
acids in proteins. Adv Protein Chem 7:1-67 


44(W1): 


. Anfinsen CB (1973) Principles that govern 


the folding of protein chains. Science 


181(4096):223-230 

Shen J, Zhang J, Luo X et al (2007) Predict- 
ing protein-protein interactions based only 
on sequences information. Proc Natl Acad 
Sci 104(11):4337-4341 

Sharma A, Lyons J, Dehzangi A et al (2013) A 
feature extraction technique using bi-gram 
probabilities of position specific scoring 
matrix for protein fold recognition. J Theor 
Biol 320:41-46 


Dong Y, Kuang Q, Dai X et al (2015) Improv- 
ing the understanding of pathogenesis of 
human papillomavirus 16 via mapping 
protein-protein interaction network. Biomed 
Res Int 2015:890381. https://doi. 
org/10.1155/2015/890381. Epub 2015 
Apr 15. PMID: 25961044; PMCID: 
PMC4414230 


Chen M, Ju CJT, Zhou G et al (2019) Multi- 
faceted protein-protein interaction prediction 
based on Siamese residual RCNN. Bioinfor- 
matics 35(14):1305-1114 

Wang Y-B, You Z-H, Li X et al (2017) Pre- 
dicting protein-protein interactions from 
protein sequences by a stacked sparse autoen- 
coder deep neural network. Mol BioSyst 
13(7):1336-1344 

Rifaioglu AS, Cetin Atalay R, Cansen Kahra- 
man D et al (2021) MDeePred: novel multi- 
channel protein featurization for deep 
learning-based binding affinity prediction in 
drug discovery. Bioinformatics 37(5): 
693-704. https: //doi.org/10.1093 /bioinfor 
matics /btaa858. PMID: 33067636 


Altschul SF, Madden TL, Schaffer AA et al 
(1997) Gapped BLAST and PSI-BLAST: a 
new generation of protein database search 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


33 


34 


35. 


programs. Nucleic Acids Res 25(17): 


3389-3402 


Chou K-C, Shen H-B (2007) MemType-2L: 
a web server for predicting membrane pro- 
teins and their types by incorporating evolu- 
tion information through  Pse-PSSM. 
Biochem Biophys Res Commun 360(2): 
339-345 


Cheol Jeong J, Lin X, Chen X-W (2010) On 
position-specific scoring matrix for protein 
function prediction. IEEE/ACM Trans Com- 
put Biol Bioinform 8(2):308-315 

Mousavian Z, Khakabimamaghani S, Kavousi 
Ket al (2016) Drug-—target interaction predic- 
tion from PSSM based evolutionary informa- 
tion. J Pharmacol Toxicol Methods 78:42-51 


Zahiri J, Mohammad-Noori M, Ebrahimpour 
Ret al (2014) LocFuse: human protein-pro- 
tein interaction prediction via classifier fusion 
using protein localization information. Geno- 
mics 104(6, Part B):496-503 

Li Z-W, You Z-H, Chen X et al (2016) Highly 
accurate prediction of protein-protein interac- 
tions via incorporating evolutionary informa- 
tion and physicochemical characteristics. Int J 
Mol Sci 17(9):1396 

Li Z-W, You Z-H, Chen X et al (2017) Accu- 
rate prediction of protein-protein interactions 
by integrating potential evolutionary informa- 
tion embedded in PSSM profile and discrimi- 
native vector machine classifier. Oncotarget 
8(14):23638-23649 

Li Y, Li L-P, Wang Let al (2019) An ensemble 
classifier to predict protein-protein interac- 
tions by combining PSSM-based evolutionary 
information with local binary pattern model. 
Int J Mol Sci 20(14):3511 

Yu J, Guo M, Needham CJ et al (2010) Sim- 
ple sequence-based kernels do not predict 
protein-protein interactions. Bioinformatics 
26(20):2610-2614 


. Dyer MD, Murali TM, Sobral BW (2007) 


Computational prediction of host-pathogen 
protein-protein interactions. Bioinformatics 
23(13):1159-1166 


. Dyer MD, Murali TM, Sobral BW (2011) 


Supervised learning and prediction of physical 
interactions between human and HIV pro- 
teins. Infect Genet Evol 11(5):917-923 


Sanchez IE, Beltrao P, Stricher F et al (2008) 
Genome-wide prediction of SH2 domain tar- 
gets using structural information and the 
FoldX algorithm. PLoS Comput Biol 4(4): 
e1000052 


36. 


37. 


38 


39. 


40. 


41. 


42. 


43. 


44. 


45. 


46. 


47. 


48. 


49, 


50. 


51. 


Fernandez-Ballester G, Beltrao P, Gonzalez 
JM et al (2009) Structure-based prediction 
of the Saccharomyces cerevisiae SH3-ligand 
interactions. J Mol Biol 388(4):902-916 

Hui S, Xing X, Bader GD (2013) Predicting 
PDZ domain mediated protein interactions 
from structure. BMC Bioinform 14(1):1-17 


. Richardson JS (1994) Introduction: protein 


motifs. FASEB J 8(15):1237-1239 

Kadaveru K, Vyas J, Schiller MR (2008) Viral 
infection and human disease-insights from 
minimotifs. Front Biosci 13:6455 

Neduva V, Russell RB (2006) Peptides med- 
iating interaction networks: new leads at last. 
Curr Opin Biotechnol 17(5):465-471 

Stein A, Pache RA, Bernado P et al (2009) 
Dynamic interactions of proteins in complex 
networks: a more structured view. FEBS J 
276(19):5390-5405 

Evans P, Dampier W, Ungar L et al (2009) 
Prediction of HIV-1 virus-host protein inter- 
actions using virus and host sequence motifs. 
BMC Med Genet 2(1):1-13 

Ben-Hur A, Noble WS (2005) Kernel meth- 
ods for predicting protein-protein interac- 
tions. Bioinformatics 21(suppl_1):i38-i46 
Greenside P, Hillenmeyer M, Kundaje A 
(2017) Prediction of protein-ligand interac- 
tions from paired protein sequence motifs 
and ligand substructures. Biocomputing 
2018. World Scientific 23:20-31 

Schapire RE, Singer Y (1999) Improved 
boosting algorithms using confidence-rated 
predictions. Mach Learn 37(3):297-336 
Segura-Cabrera A, Garcia-Pérez CA, Guo X 
et al (2013) A viral-human interactome based 
on structural motif-domain interactions cap- 
tures the human infectome. PLoS One 8(8): 
e71526 

Via A, Gould CM, Gemiind C et al (2009) A 
structure filter for the eukaryotic linear motif 
resource. BMC Bioinform 10(1):1-17 

Deng L, Zhang QC, Chen Z et al (2014) 
PredHS: a web server for predicting protein— 
protein interaction hot spots by using struc- 
tural neighborhood properties. Nucleic Acids 
Res 42(W1):W290-W2W5 

Petrey D, Chen TS, Deng L et al (2015) 
Template-based prediction of protein func- 
tion. Curr Opin Struct Biol 32:33-38 

Zhang QC, Petrey D, Deng L et al (2012) 
Structure-based prediction of protein-protein 
interactions on a genome-wide scale. Nature 
490(7421):556-560 

Garzon JI, Deng L, Murray D et al (2016) A 
computational interactome and functional 


52 


53. 


54. 


55. 


56. 


57. 


58. 


59. 


60. 


ol. 


62. 


63. 


64. 


65. 


66. 


Prediction of Protein Interactions 317 


annotation for the human proteome. elife 5: 
e18715 


. Xia Z, Wu L-Y, Zhou X et al (2010) Semi- 


supervised drug-protein interaction predic- 
tion from heterogeneous biological spaces. 
BMC Syst Biol 4(2):S6 

Eslami Manoochehri H, Nourani M (2020) 
Drug-target interaction prediction using 
semi-bipartite graph model and _ deep 
learning. BMC Bioinform 21(4):248 

Kovacs IA, Luck K, Spirohn K et al (2019) 
Network-based prediction of protein interac- 
tions. Nat Commun 10(1):1-8 

Fang Y, Sun M, Dai Get al (2015) The intrin- 
sic geometric structure of protein-protein 
interaction networks for protein interaction 
prediction. IEEE/ACM Trans Comput Biol 
Bioinform 13(1):76-85 

Clauset A, Moore C, Newman ME]J (2008) 
Hierarchical structure and the prediction of 
missing links in networks. Nature 
453(7191):98-101 

Luo X, Ming Z, You Z et al (2015) Improving 
network topology-based protein interactome 
mapping via collaborative filtering. Knowl- 
Based Syst 90:23-32 

Yu H, Paccanaro A, Trifonov V et al (2006) 
Predicting interactions in protein networks by 
completing defective cliques. Bioinformatics 
22(7):823-829 

Chen Y, Varani G (2005) Protein families and 
RNA recognition. FEBS J 272(9):2088-2097 
Glisovic T, Bachorik JL, Yong J et al (2008) 
RNA-binding proteins and post- 
transcriptional gene regulation. FEBS Lett 
582(14):1977-1986 

Cooper TA, Wan L, Dreyfuss G (2009) RNA 
and disease. Cell 136(4):777-793 

Ke A, Doudna JA (2004) Crystallization of 
RNA and RNA-protein complexes. Methods 
34(3):408-414 

Chatterjee N, Walker GC (2017) Mechanisms 
of DNA damage, repair, and mutagenesis. 
Environ Mol Mutag 58(5):235-263 

Zhang K, Pan X, Yang Y et al (2018) Predict- 
ing circRNA-RBP interaction sites using a 
codon-based encoding and hybrid deep neu- 
ral networks. bioRxiv:499012 

You Z-H, Zhou M, Luo X et al (2016) Highly 
efficient framework for predicting interactions 
between proteins. IEEE Transact Cybernet 
47(3):731-743 

Pan X, Shen H-B (2018) Learning distributed 
representations of RNA sequences and_ its 
application for predicting RNA-protein 


318 


67. 


68. 


69. 


70. 


71. 


72. 


73. 


74. 


75. 


76. 


77. 


78. 


79. 


80. 


81. 


Dhvani Sandip Vora et al. 


binding sites with a convolutional neural net- 
work. Neurocomputing 305:51-58 

Xiao Y, Cai J, Yang Y et al (2018) Prediction 
of microrna subcellular localization by using a 
sequence-to-sequence model. IEEE: 
1332-1337. https://doi.org/10.1109/ 
ICDM.2018.00181 

Takenaka T (2001) Classical vs reverse phar- 
macology in drug discovery. BJU Int 88:7-10 
Paul A (2019) Translational and reverse phar- 
macology. Introduction to basics of pharma- 
cology and toxicology. Springer, pp 313-317 
Ezzat A, Wu M, Li X-L et al (2019) Compu- 
tational prediction of drug—target interactions 
using chemogenomic approaches: an empiri- 
cal survey. Brief Bioinform 20(4):1337-1357 
Hopkins AL (2009) Predicting promiscuity. 
Nature 462(7270):167-168 

Swamidass SJ (2011) Mining small-molecule 
screens to repurpose drugs. Brief Bioinform 
12(4):327-335 

Pauwels E, Stoven V, Yamanishi Y (2011) 
Predicting drug side-effect profiles: a chemical 
fragment-based approach. BMC Bioinform 
12(1):1-13 

Yamanishi Y, Araki M, Gutteridge A et al 
(2008) Prediction of drug-target interaction 
networks from the integration of chemical 
and genomic spaces. Bioinformatics 24(13): 
i232-1140 

Weininger DSMILES (1988) A chemical lan- 
guage and information system. 1. Introduc- 
tion to methodology and encoding rules. J 
Chem Inf Comput Sci 28(1):31-36 

Krenn M, Hise F, Nigam A et al (2020) Self- 
referencing embedded strings (SELFIES): a 
100% robust molecular string representation. 
Mach Learn: Sci Technol 1(4):045024 
Daylight Chemical Information Systems I: 
https: //www.daylight.com/dayhtml/doc/ 
theory/theory.smarts.html. (2021). Accessed 
22-04-2021 

Hirohara M, Saito Y, Koda Y et al (2018) 
Convolutional neural network based on 
SMILES representation of compounds for 
detecting chemical motif. BMC Bioinform 
19(19):83-94 

Méndez-Lucio O, Baillif B, Clevert D-A et al 
(2020) De novo generation of hit-like mole- 
cules from gene expression signatures using 
artificial intelligence. Nat Commun 11(1): 
1-10 

Mikolov T, Chen K, Corrado G et al (2013) 
Efficient estimation of word representations 
in vector space. arXiv preprint 
arXiv:13013781 

Zhang Y-F, Wang X, Kaushik AC et al (2020) 
SPVec: a  Word2vec-inspired feature 


82 


83. 


84. 


85. 


86. 


87. 


88. 


89. 


90. 


91. 


92. 


93. 


94. 


95. 


representation method for drug-target inter- 
action prediction. Front Chem 7:895 


. Reymond J-L, Van Deursen R, Blum LC et al 


(2010) Chemical space as a source for new 
drugs. MedChemComm 1(1):30-38 


Faulon J-L, Misra M, Martin S et al (2008) 
Genome scale enzyme-metabolite and drug—- 
target interaction predictions using the signa- 
ture molecular descriptor. Bioinformatics 
24(2):225-233 

Steffen A, Kogej T, Tyrchan C et al (2009) 
Comparison of molecular fingerprint meth- 
ods on the basis of biological profile data. J 
Chem Inf Model 49(2):338-347 
Vamathevan J, Clark D, Czodrowski P et al 
(2019) Applications of machine learning in 
drug discovery and development. Nat Rev 
Drug Discov 18(6):463-477 

Wu Z, Ramsundar B, Feinberg EN et al 
(2018) MoleculeNet: a benchmark for molec- 
ular machine learning. Chem Sci 9(2): 
513-530 

Guo Y, Yu L, Wen Z et al (2008) Using 
support vector machine combined with auto 
covariance to predict protein-protein interac- 
tions from protein sequences. Nucleic Acids 
Res 36(9):3025-3030 

Cosic I, Hearn MTW (1992) Studies on 
protein-DNA interactions using the resonant 
recognition model: application to repressors 
and transforming proteins. Eur J Biochem 
205(2):613-619 

Wang Y, You Z, Li L et al (2020) A survey of 
current trends in computational predictions 
of protein-protein interactions. Front Comp 
Sci 14(4):1-12 

Su AI, Wiltshire T, Batalov S et al (2004) A 
gene atlas of the mouse and human protein- 
encoding transcriptomes. Proc Natl Acad Sci 
101(16):6062-6067 

Kissopoulou A, Jonasson J, Lindahl TL et al 
(2013) Next generation sequencing analysis 
of human platelet PolyA+ mRNAs and 
rRNA-depleted total RNA. PLoS One 8(12): 
e81809 

Hillier L, Lennon G, Becker M et al (1996) 
Generation and analysis of 280,000 human 
expressed sequence tags. Genome Res 6(9): 
807-828 

Manyika J, Chui M, Brown B et al (2011) 
Big data: the next frontier for innovation, 
competition, and productivity 

Larrafiaga P, Calvo B, Santana R et al (2006) 
Machine learning in bioinformatics. Brief 
Bioinform 7(1):86-112 

Ferrucci D, Brown E, Chu-Carroll J et al 
(2010) Building Watson: an overview of the 
DeepQA project. AI Mag 31:59-79 


96. 


97. 


98. 


99. 


100. 


101. 


102. 


103. 


104. 


105. 


106. 


107. 


108. 


109. 


Singh A (2020) Deep learning 3D structures. 
Nat Methods 17(3):249 

Orchard S, Kerrien S, Abbani S, Abbani S, 
Aranda B et al (2012) Protein interaction 
data curation: the International Molecular 
Exchange (IMEx) consortium. (1548-7105 
(Electronic)) 

Lee I, Blom UM, Wang PI, Wang PI, Shim JE 
et al (2011) Prioritizing candidate disease 
genes by _ network-based boosting of 
genome-wide association data. (1549-5469 
(Electronic)) 

Montojo J, Zuberi K, Rodriguez H, Kazi F 
et al (2010) GeneMANIA Cytoscape plugin: 
fast gene function predictions on the desktop. 
(1367-4811 (Electronic)) 

Schmitt T, Ogris C, Sonnhammer ELL. 
(2014) FunCoup 3.0: database of genome- 
wide functional coupling networks. 
(1362-4962 (Electronic)) 

Pearson WR (1990) Rapid and_ sensitive 
sequence comparison with FASTP and 
FASTA. (0076-6879 (Print)) 

Bateman A, Coin L, Durbin R et al (2004) 
The Pfam protein families database. Nucleic 
Acids Res 32(Database issue):D138—DD41 
Corpet F, Servant F, Gouzy J et al (2000) 
ProDom and ProDom-CG: tools for protein 
domain analysis and whole genome compar- 
isons. Nucleic Acids Res 28(1):267-269 
Cheng F, Zhao J, Zhao Z (2016) Advances in 
computational approaches for prioritizing 
driver mutations and significantly mutated 
genes in cancer genomes. (1477-4054 
(Electronic)) 

Tranchevent LC, Ardeshirdavani A, ElShal S 
et al (2016) Candidate gene prioritization 
with Endeavour. (1362-4962 (Electronic)) 
Aytuna AS, Gursoy A, Keskin O (2005) Pre- 
diction of protein-protein interactions by 
combining structure and sequence conserva- 
tion in protein interfaces. Bioinformatics 
21(12):2850-2855 

Kundrotas PJ, Vakser IA (2010) Accuracy of 
protein-protein binding sites in high- 
throughput template-based modeling. PLoS 
Comput Biol 6(4):e1000727 

Aloy P, Ceulemans H, Stark A et al (2003) 
The relationship between sequence and inter- 
action divergence in proteins. J Mol Biol 
332(5):989-998 

Park S-Y, Beel BD, Simon MI et al (2004) In 
different organisms, the mode of interaction 
between two signaling proteins is not neces- 
sarily conserved. Proc Natl Acad Sci 101(32): 
11646-11651 


110. 


lll 


112. 


113. 


114. 


115. 


116. 


117. 


118. 


119. 


120. 


121 


122 


123. 


124. 


Prediction of Protein Interactions 319 


Bahar I, Jernigan RL (1996) Coordination 
geometry of nonbonded residues in globular 
proteins. Fold Des 1(5):357-370 


. Anashkina A, Kuznetsov E, Esipova N et al 


(2007) Comprehensive statistical analysis of 
residues interaction specificity at protein—pro- 
tein interfaces. Proteins: Struct Funct Bioin- 
form 67(4):1060-1077 

Aloy P, Russell RB (2003) InterPreTS: pro- 
tein inter action pre diction through tertiary 
structure. Bioinformatics 19(1):161-162 
Kundrotas PJ, Zhu Z, Vakser IA (2010) 
GWIDD: genome-wide protein docking 
database. Nucleic Acids Res 38(suppl_1): 
D513-D5D7 

Talavera D, Laskowski RA, Thornton JM 
(2009) WSsas: a web service for the annota- 
tion of functional residues through structural 
homologues. Bioinformatics 25(9): 
1192-1194 

Shoemaker BA, Panchenko AR, Bryant SH 
(2006) Finding biologically relevant protein 
domain interactions: conserved binding 
mode analysis. Protein Sci 15(2):352-361 
Davis FP, Braberg H, Shen M-Y et al (2006) 
Protein complex compositions predicted by 
structural similarity. Nucleic Acids Res 
34(10):2943-2952 

Bai H, Ma W, Liu S et al (2008) Dynamic 
property is a key determinant for protein—pro- 
tein interactions. Proteins: Struct Funct 
Bioinform 70(4):1323-1331 

Caffrey DR, Somaroo S, Hughes JD et al 
(2004) Are protein-protein interfaces more 
conserved in sequence than the rest of the 
protein surface? Protein Sci 13(1):190-202 
Keskin O, Tsai CJ, Wolfson H et al (2004) A 
new, structurally nonredundant, diverse data 
set of protein-protein interfaces and its impli- 
cations. Protein Sci 13(4):1043-1055 
Ogmen U, Keskin O, Aytuna AS et al (2005) 
PRISM: protein interactions by structural 
matching. Nucleic Acids Res 33(suppl_2): 
W331-W3W6 


. Sinha R, Kundrotas PJ, Vakser IA (2010) 


Docking by structural similarity at protein- 
protein interfaces. Proteins: Struct Funct 
Bioinform 78(15):3235-3241 


. Zhang QC, Petrey D, Norel R et al (2010) 


Protein interface conservation across struc- 
ture space. Proc Natl Acad Sci 107(24): 
10896-10901 

Jansen R, Greenbaum D, Gerstein M (2002) 
Relating whole-genome expression data with 
protein-protein interactions. Genome Res 
12(1):37-46 

Dandekar T, Snel B, Huynen M et al (1998) 
Conservation of gene order: a fingerprint of 


320 


125. 


126. 


127. 


128. 


129. 


130. 


131. 


132. 


133. 


134. 


135. 


136. 


137. 


138. 


Dhvani Sandip Vora et al. 


proteins that physically interact. Trends Bio- 
chem Sci 23(9):324-328 

Franceschini A, Szklarczyk D, Frankild S et al 
(2012) STRING v9. 1: protein-protein inter- 
action networks, with increased coverage and 
integration. Nucleic Acids Res 41(D1): 
D808-DD15 

Muley VY, Ranjan A (2013) Evaluation of 
physical and functional protein-protein inter- 
action prediction methods for detecting 
biological pathways. PLoS One 8(1):e54325 
Chua HN, Sung W-K, Wong L (2006) 
Exploiting indirect neighbours and topologi- 
cal weight to predict protein function from 
protein-protein interactions. Bioinformatics 
22(13):1623-1630 

Przulj N, Wigle DA, Jurisica I (2004) Func- 
tional topology in a network of protein inter- 
actions. Bioinformatics 20(3):340-348 
Goldberg DS, Roth FP (2003) Assessing 
experimentally derived interactions in a small 
world. Proc Natl Acad Sci 100(8):4372-4376 
Phan HTT, Sternberg MJE (2012) PINA- 
LOG: a novel approach to align protein inter- 
action networks—implications for complex 
detection and function prediction. Bioinfor- 
matics 28(9):1239-1245 

Ma C-Y, Liao C-S (2020) A review of pro- 
tein-protein interaction network alignment: 
from pathway comparison to global align- 
ment. Comput Struct Biotechnol J 18:2647 
Sarkar D, Saha S (2019) Machine-learning 
techniques for the prediction of protein- 
protein interactions. LID - 104 [pii]. 
(0973-7138 (Electronic)) 

Zhang M, Su Q, Lu Yet al (2017) Application 
of machine learning approaches for protein- 
protein interactions prediction. (1875-6638 
(Electronic)) 

Browne F, Wang H, Zheng H, Azuaje F et al 
(2010) A knowledge-driven probabilistic 
framework for the prediction of protein- 
protein interaction networks. (1879-0534 
(Electronic)) 

Li Y, Ding L, Gao X (2018) On the decision 
boundary of deep neural networks. arXiv pre- 
print arXiv:180805385 

Yosinski J, Clune J, Bengio Y, et al. How 
transferable are features in deep neural net- 
works? arXiv preprint arXiv:14111792. 2014 
Tan C, Sun F, Kong T et al (2018) A survey 
on deep transfer learning. Springer:270-279 
Li Y, Xu F, Zhang F et al (2018) DLBI: deep 
learning guided Bayesian inference for struc- 
ture reconstruction of super-resolution fluo- 
rescence microscopy. Bioinformatics 34(13): 
1284-194 


139. 


140. 


141 


142 


143 


144. 


145. 


146. 


147 


148. 


149. 


150. 


151 


152 


153. 


154. 


Srivastava N, Hinton G, Krizhevsky A et al 
(2014) Dropout: a simple way to prevent 
neural networks from overfitting. J Mach 
Learn Res 15(1):1929-1958 

Soudry D, Hoffer E, Nacson MS et al (2018) 
The implicit bias of gradient descent on sepa- 
rable data. J Mach Learn Res 19(1): 
2822-2878 


. Zhang C, Bengio S, Hardt M et al (2016) 


Understanding deep learning requires 
rethinking generalization. arXiv preprint 
arXiv:161103530 


. loffe S$, Szegedy C (2015) Batch normaliza- 


tion: Accelerating deep network training by 
reducing internal covariate shift. PMLR, pp 
448-456 


. Krogh A, Hertz JA (1992) A simple weight 


decay can 
950-957 
Maaten L, Chen M, Tyree S et al (2013) 
Learning with marginalized corrupted 
features. PMLR:410-418 

Pereyra G, Tucker G, Chorowski J et al 
(2017) Regularizing neural networks by 
penalizing confident output distributions. 
arXiv preprint arXiv:170106548 

Kukacka J, Golkov V, Cremers D (2017) Reg- 
ularization for deep learning: a taxonomy. 
arXiv preprint arXiv:171010686 


improve generalization, pp 


. Yang P, Zhang Z, Zhou BB et al (2011) Sam- 


ple subset optimization for classifying imbal- 
anced biological data. Springer, pp 333-344 
Wang S, Sun S, Xu J (2015) Auc-maximized 
deep convolutional neural fields for sequence 
labeling. arXiv preprint arXiv:151105265 
Cao C, Chicco D, Hoffman MM (2020) The 
MCC-F1 curve: a performance evaluation 
technique for binary classification. arXiv pre- 
print arXiv:200611278 

Li Y, Wang S, Umarov R et al (2018) 
DEEPre: sequence-based enzyme EC num- 
ber prediction by deep learning. Bioinformat- 
ics 34(5):760-769 


. Alipanahi B, Delong A, Weirauch MT et al 


(2015) Predicting the sequence specificities 
of DNA-and RNA-binding proteins by deep 
learning. Nat Biotechnol 33(8):831-838 


. Ribeiro MT, Singh S, Guestrin C (2016) 


“Why should I trust you?” Explaining the 
predictions of any classifier, pp 1135-1144 
Umarov RK, Solovyev VV (2017) Recogni- 
tion of prokaryotic and eukaryotic promoters 
using convolutional deep learning neural net- 
works. PLoS One 12(2):e0171410 

Zhou J, Troyanskaya OG (2015) Predicting 
effects of noncoding variants with deep lear- 
ning—based sequence model. Nat Methods 
12(10):931-934 


155. 


156. 


157. 


158. 


159. 


160. 


lol. 


162. 


163. 


164. 


165. 


166. 


167. 


168. 


169. 


Shrikumar A, Greenside P, Kundaje A (2017) 
Learning important features through propa- 
gating activation differences. In: Doina P, Yee 
Whye T (eds) Proceedings of the 34th inter- 
national conference on machine learning. 
Proceedings of achine learning research. 
PMLIR, pp 3145-3153 

Samek W, Binder A, Montavon G et al (2016) 
Evaluating the visualization of what a deep 
neural network has learned. IEEE Transac 
Neural Netw Learn Syst 28(11):2660-2673 
Platt J (1999) Probabilistic outputs for sup- 
port vector machines and comparisons to reg- 
ularized likelihood methods. Adv Large 
Margin Classifiers 10(3):61-74 

Naeini MP, Cooper G,  Hauskrecht 
M. Obtaining well calibrated probabilities 
using Bayesian binning. 2015 

Zadrozny B, Elkan C (2001) Obtaining 
calibrated probability estimates from decision 
trees and naive Bayesian classifiers. 
Citeseer:609-616. ICML 2001 

Guo C, Pleiss G, Sun Y et al (2017) On cali- 
bration of modern neural networks. Proceed- 
ings of the 34th International Conference on 
Machine Learning, PMLR 70:1321-1330 
Kirkpatrick J, Pascanu R, Rabinowitz N et al 
(2017) Overcoming catastrophic forgetting 
in neural networks. Proc Natl Acad Sci 
114(13):3521 

Li Y, Li Z, Ding L et al (2018) Supportnet: 
solving catastrophic forgetting in class incre- 
mental learning with support data. arXiv pre- 
print arXiv:180602942 

Rebuffi S-A, Kolesnikov A, Sperl G et al 
(2017) icarl: incremental classifier and repre- 
sentation learning, pp 2001-2010 

Hinton GE, Plaut DC (1987) Using fast 
weights to deblur old memories, pp 177-186 
Almagro Armenteros JJ, Sonderby CK, Son- 
derby SK et al (2017) DeepLoc: prediction of 
protein subcellular localization using deep 
learning. Bioinformatics 33(21):3387-3395 
Sun T, Zhou B, Lai L et al (2017) Sequence- 
based prediction of protein protein interac- 
tion using a deep-learning algorithm. BMC 
Bioinform 18(1):1-8 

Yi H-C, You Z-H, Huang D-S et al (2018) A 
deep learning framework for robust and accu- 
rate prediction of ncRNA-protein interactions 
using evolutionary information. Mol Ther- 
Nucl Acids 11:337-344 

Subramanian I, Verma S, Kumar S et al 
(2020) Multi-omics data integration, inter- 
pretation, and its application. Bioinform Biol 
Insights 14:1177932219899051 

Sennrich R, Haddow B, Birch A (2015) 
Neural machine translation of rare words 


170. 


171 


172. 


173. 


174. 


175. 


176. 


177. 


178. 


179. 


180. 


181. 


182. 


183. 


Prediction of Protein Interactions 321 


with subword units. In Proceedings of the 
54th Annual Meeting of the Association for 
Computational Linguistics (Volume 1: Long 
Papers), pp 1715-1725. Berlin, Germany 
Kudo T Japa. Subword regularization: 
improving neural network translation models 
with multiple subword candidates. 2018 


. Kudo T, Richardson J Japa. SentencePiece: A 


simple and language independent subword 
tokenizer and detokenizer for neural text 
processing. 2018 


Rebentrost P, Mohseni M, Lloyd S (2014) 
Quantum support vector machine for big 
data classification. Phys Rev Lett 113(13): 
130503 

Crawford D, Levit A, Ghadermarzy N et al 
(2016) Reinforcement learning using quan- 
tum Boltzmann machines. arXiv preprint 
arXiv:161205695 

Khrennikov A, Yurova E (2017) Automaton 
model of protein: dynamics of conformational 
and functional states. Prog Biophys Mol Biol 
130:2-14 

Hermjakob H, Montecchi-Palazzi L, Bader G 
et al (2004) The HUPO PSI’s molecular 
interaction format—a community standard 
for the representation of protein interaction 
data. Nat Biotechnol 22(2):177-183 
Narayan P, Orte A, Clarke RW et al (2012) 
The extracellular chaperone cluster in seques- 
ters oligomeric forms of the amyloid-B 1— 
40 peptide. Nat Struct Mol Biol 19(1):79-83 
Heegaard NHH (2009) Affinity in electro- 
phoresis. Electrophoresis 30(S1):S229-SS39 
Rogers KR (2000) Principles of affinity-based 
biosensors. Mol Biotechnol 14(2):109-129 
Wallace BA, Janes RW (2001) Synchrotron 
radiation circular dichroism spectroscopy of 
proteins: secondary structure, fold recogni- 
tion and structural genomics. Curr Opin 
Chem Biol 5(5):567-571 

van Liempd S, Morrison D, Sysmans L et al 
(2011) Development and validation of a 
higher-throughput equilibrium dialysis assay 
for plasma protein binding. JALA: J Assoc 
Lab Autom 16(1):56-67 

Muchowski PJ, Schaffar G, Sittler A et al 
(2000) Hsp70 and hsp40 chaperones can 
inhibit self-assembly of polyglutamine pro- 
teins into amyloid-like fibrils. Proc Natl Acad 
Sci 97(14):7841-7846 

Demirdéven N, Cheatum CM, Chung HS 
et al (2004) Two-dimensional infrared spec- 
troscopy of antiparallel B-sheet secondary 
structure. J Am Chem Soc 126(25): 
7981-7990 

Prakasam AK, Maruthamuthu V, Leckband 
DE (2006) Similarities between heterophilic 


322 


184. 


185. 


186. 


187. 


188. 


189. 


190. 


191. 


192. 


193. 


194. 


195. 


196. 


197. 


Dhvani Sandip Vora et al. 


and homophilic cadherin adhesion. Proc Natl 
Acad Sci 103(42):15434-15439 

Leavitt S, Freire E (2001) Direct measure- 
ment of protein binding energetics by isother- 
mal titration calorimetry. Curr Opin Struct 
Biol 11(5):560-566 

Murphy RM (1997) Static and dynamic light 
scattering of biological macromolecules: what 
can we learn? Curr Opin Biotechnol 8(1): 
25-30 

Hanson BL (2004) Getting protein solvent 
structures down cold. Proc Natl Acad Sci 
101(47):16393-16394 

Pellecchia M, Sem DS, Wiithrich K (2002) 
NMR in drug discovery. Nat Rev Drug Dis- 
cov 1(3):211-219 

Udenfriend S, Gerber LD, Brink L et al 
(1985) Scintillation proximity radioimmuno- 
assay utilizing 125I-labeled ligands. Proc Natl 
Acad Sci 82(24):8672-8676 

Cotruvo JA, Stubbe J (2008) NrdI, a flavo- 
doxin involved in maintenance of the diferric- 
tyrosyl radical cofactor in Escherichia coli class 
Ib ribonucleotide reductase. Proc Natl Acad 
Sci 105(38):14383-14388 

Lowe TM, Eddy SR (1999) A computational 
screen for methylation guide snoRNAs in 
yeast. Science 283(5405):1168-1171 
Modesti M, Ristic D, Van Der Heijden T et al 
(2007) Fluorescent human RAD51 reveals 
multiple nucleation sites and filament seg- 
ments tightly associated along a single DNA 
molecule. Structure 15(5):599-609 

Unger VM (2001) Electron cryomicroscopy 
methods. Curr Opin Struct Biol 11(5): 
548-554 

Tanabe Y, Fujita E, Momoi T (2011) FOXP2 
promotes the nuclear translocation of POT], 
but FOXP2 (R553H), mutation related to 
speech-language disorder, partially prevents 
it. Biochem Biophys Res Commun 410(3): 
593-596 

Denhardt DT (1992) Mechanism of action of 
antisense RNA. Sometime inhibition of tran- 
scription, processing, transport, or transla- 
tion. Ann N Y Acad Sci 660:70-76 

Chiu Y-L, Rana TM (2002) RNAi in human 
cells: basic structural and functional features 
of small interfering RNA. Mol Cell 10(3): 
549-561 

Karimova G, Pidoux J, Ullmann A et al 
(1998) A bacterial two-hybrid system based 
on a reconstituted signal transduction path- 
way. Proc Natl Acad Sci 95(10):5752-5756 
Rossi F, Charlton CA, Blau HM (1997) Mon- 
itoring protein-protein interactions in intact 
eukaryotic cells by B-galactosidase 


198. 


199. 


200. 


201. 


202. 


203. 


204. 


205. 


206. 


207. 


208. 


209. 


210. 


complementation. Proc Natl Acad Sci 
94(16):8405-8410 

Galarneau A, Primeau M, Trudeau L-E et al 
(2002) B-Lactamase protein fragment com- 
plementation assays as in vivo and in vitro 
sensors of protein-protein interactions. Nat 


Biotechnol 20(6):619-622 

Hu C-D, Chinenov Y, Kerppola TK (2002) 
Visualization of interactions among bZIP and 
Rel family proteins in living cells using bimo- 
lecular fluorescence complementation. Mol 
Cell 9(4):789-798 

Lemmens I, Eyckerman S, Zabeau L et al 
(2003) Heteromeric MAPPIT: a novel strat- 
egy to study modification-dependent pro- 
tein—-protein interactions in mammalian cells. 
Nucleic Acids Res 31(14):e75-e 

Stefan E, Aquin S, Berger N et al (2007) 
Quantification of dynamic protein complexes 
using Renilla luciferase fragment complemen- 
tation applied to protein kinase A activities 
in vivo. Proc Natl Acad Sci 104(43): 
16916-16921 

Hubsman M, Yudkovsky G, Aronheim A 
(2001) A novel approach for the identifica- 
tion of protein-protein interaction with inte- 
gral membrane proteins. Nucleic Acids Res 
29(4):e18-e 

Kato N, Jones J (2010) The split luciferase 
complementation assay. Plant Develop Biol 
Springer 655:359-376 

Russ WP, Engelman DM (1999) TOXCAT: a 
measure of transmembrane helix association 
in a biological membrane. Proc Natl Acad Sci 
96(3):863-868 

Kozakov D, Hall DR, Xia B et al (2017) The 
ClusPro web server for protein-protein dock- 
ing. Nat Protoc 12(2):255 

Tovchigrechko A, Vakser IA (2006) 
GRAMM.-X public web server for protein—- 
protein docking. Nucleic Acids Res 34 
(suppl_2):W310-W3W4 

Macindoe G, Mavridis L, Venkatraman V et al 
(2010) HexServer: an FFT-based protein 
docking server powered by graphics proces- 
sors. Nucleic Acids Res 38(suppl_2): 
W445-W4W9 

Venkatraman V, Yang YD, Sael L et al (2009) 
Protein-protein docking using region-based 
3D Zernike descriptors. BMC Bioinform 
10(1):1-21 

Esquivel-Rodriguez J, Yang YD, Kihara D 
(2012) Multi-LZerD: multiple protein dock- 
ing for asymmetric complexes. Proteins: 
Struct Funct Bioinform 80(7):1818-1833 
Duhovny D, Nussinov R, Wolfson HJ (2002) 
Efficient unbound docking of rigid mole- 
cules. Springer:185-200 


211. 


212. 


213. 


214. 


215. 


216. 


217. 


218. 


219. 


220. 


221. 


222. 


223. 


224. 


Gray JJ, Moughon S, Wang C et al (2003) 
Protein—-protein docking with simultaneous 
optimization of rigid-body displacement and 
side-chain conformations. J Mol Biol 331(1): 
281-299 

Pierce BG, Hourai Y, Weng Z (2011) Accel- 
erating protein docking in ZDOCK using an 
advanced 3D convolution library. PLoS One 
6(9):e24657 

Breitkreutz B-J, Stark C, Reguly T et al 
(2007) The BioGRID interaction database: 
2008 update. Nucleic Acids Res 36 
(suppl_1):D637-DD40 

Xenarios I, Rice DW, Salwinski L et al 
(2000) DIP: the database of interacting pro- 
teins. Nucleic Acids Res 28(1):289-291 

Das J, Yu H (2012) HINT: high-quality pro- 
tein interactomes and their applications in 
understanding human disease. BMC Syst 
Biol 6(1):1-12 

Peri S, Navarro JD, Amanchy R et al (2003) 
Development of human protein reference 
database as an initial platform for approaching 
systems biology in humans. Genome Res 
13(10):2363-2371 

Keshava Prasad TT, Goel R, Kandasamy K 
et al (2009) Human protein reference data- 
base—2009 update. Nucleic Acids Res 37 
(suppl_1):D767—DD72 

Szklarczyk D, Franceschini A, Wyder S et al 
(2015) STRING v10: protein-protein inter- 
action networks, integrated over the tree of 
life. Nucleic Acids Res 43(D1):D447-DD52 
Calderone A, Castagnoli L, Cesareni G 
(2013) Mentha: a resource for browsing 
integrated protein-interaction networks. Nat 
Methods 10(8):690-691 

Schaefer MH, Fontaine J-F, Vinayagam A 
et al (2012) HIPPIE: integrating protein 
interaction networks with experiment based 
quality scores. PLoS One 7(2):e31826 
Licata L, Briganti L, Peluso D et al (2012) 
MINT, the molecular interaction database: 
2012 update. Nucleic Acids Res 40(D1): 
D857-DD61 

Wang S, Peng J, Ma J et al (2016) Protein 
secondary structure prediction using deep 
convolutional neural fields. Sci Rep 6(1): 
18962 

Heffernan R, Paliwal K, Lyons J et al (2015) 
Improving prediction of secondary structure, 
local backbone angles and solvent accessible 
surface area of proteins by iterative deep 
learning. Sci Rep 5(1):11476 

Lyons J, Dehzangi A, Heffernan R et al 
(2014) Predicting backbone Ca angles and 
dihedrals from protein sequences by stacked 


225. 


226. 


227. 


228. 


229. 


230. 


231. 


232. 


233. 


234. 


235. 


236. 


237. 


Prediction of Protein Interactions 323 


sparse auto-encoder deep neural network. J 
Comput Chem 35(28):2040-2046 

Min S, Lee B, Yoon S (2017) Deep learning in 
bioinformatics. Brief Bioinform  18(5): 
851-869 

Leung MK, Xiong HY, Lee LJ et al (2014) 
Deep learning of the tissue-regulated splicing 
code. (1367-4811 (Electronic)) 

Taehoon L, Sungroh Y Boosted categorical 
restricted Boltzmann machine for computa- 
tional prediction of splice junctions. PMLR, 
pp 2483-2492 

Zhang S, Liu CC, Li W, Shen H, Laird PW, 
Zhou XJ (2012) Discovery of multi-dimen- 
sional modules by integrative analysis of can- 
cer genomic data. Nucleic Acids Res 40 
(19):9379-9391. https: //doi.org/10.1093/ 
nar/gks725. Epub 2012 Aug 8. PMID: 
22879375; PMCID: PMC3479191 


Taji B, Chan A, ~ Shirmohammadi 
S. Classifying measured electrocardiogram 
signal quality using deep belief 


networks. 2017 


Asgari E, Mofrad MR (2015) Continuous 
distributed representation of biological 
sequences for deep proteomics and genomics. 
(1932-6203 (Electronic)) 


Fakoor R, Ladhak F, Nazi A, et al. Using deep 
learning to enhance cancer diagnosis and 
classification. 2013 


Baldi P, Brunak S, Frasconi P, Soda G et al 
(1999) Exploiting the past and the future in 
protein secondary structure prediction. 
(1367-4803 (Print)) 


Soleymani M, Asghari-Esfeden S, Pantic M, 
et al (2014) Continuous emotion detection 
using EEG signals and facial expressions. 
2014 IEEE International Conference on 
Multimedia and Expo (ICME). 1-6 


Lee B, Baek J, Park S et al (2016) deepTarget: 
end-to-end learning framework for micro- 
RNA target prediction using deep recurrent 
neural networks. ACM 434-442. https: //doi. 
org/10.1145/2975167.2975212 

Petrosian A, Prokhorov D, Homan R et al 
(2000) Recurrent neural network based pre- 
diction of epileptic seizures in intra- and 
extracranial EEG. Neurocomputing 30: 
201-218 

Hochreiter S, Heusel M, Obermayer 
K (2007) Fast model-based protein homol- 
ogy detection without alignment. 
(1367-4811 (Electronic)) 

Wang S, Sun S, Li Z et al (2017) Accurate De 
novo prediction of protein contact map by 
ultra-deep learning model. PLoS Comput 
Biol 13(1):e1005324-e 


Check for 
updates 


Machine Learning Methods for Survival Analysis 
with Clinical and Transcriptomics Data of Breast Cancer 


Le Minh Thao Doan, Claudio Angione, and Annalisa Occhipinti 


Abstract 


Breast cancer is one of the most common cancers in women worldwide, which causes an enormous number 
of deaths annually. However, early diagnosis of breast cancer can improve survival outcomes enabling 
simpler and more cost-effective treatments. The recent increase in data availability provides unprecedented 
opportunities to apply data-driven and machine learning methods to identify early-detection prognostic 
factors capable of predicting the expected survival and potential sensitivity to treatment of patients, with the 
final aim of enhancing clinical outcomes. This tutorial presents a protocol for applying machine learning 
models in survival analysis for both clinical and transcriptomic data. We show that integrating clinical and 
mRNA expression data is essential to explain the multiple biological processes driving cancer progression. 
Our results reveal that machine-learning-based models such as random survival forests, gradient boosted 
survival model, and survival support vector machine can outperform the traditional statistical methods, i.e., 
Cox proportional hazard model. The highest C-index among the machine learning models was recorded 
when using survival support vector machine, with a value 0.688, whereas the C-index recorded using the 
Cox model was 0.677. Shapley Additive Explanation (SHAP) values were also applied to identify the feature 
importance of the models and their impact on the prediction outcomes. 


Key words Breast cancer, Machine learning, Survival analysis, Data integration, Interpretability 


1. Introduction 


Breast cancer is a leading cause of cancer-related deaths worldwide 
[1]. According to the latest report published by Cancer Research 
UK, breast cancer occupies 15% of the annual new cancer cases and 
7% of all cancer mortality in the UK [2]. With advancements in 
medical treatment and research, the overall survival rate has nearly 
doubled in the last 40 years, e.g., around 78% of the patients survive 
more than 10 years [2]. The survival rate after 5 years for early 
diagnosed patients varies between 90% and 99%, while this rate 
sharply drops to only 28% for late diagnosed patients [3]. Therefore, 
early detection and treatment are crucial for breast cancer patients 
as malignant cells tend to metastasize in later phases [4]. 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_16, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


325 


326 


Le Minh Thao Doan et al. 


Clinical data has been often used to develop clinical prediction 
models and gain disease insights [5]. More recently, with the 
advancement of high-throughput sequencing technology, exten- 
sive omics data and methods for their integration have been pro- 
duced, including genomics, transcriptomics, proteomics, and 
metabolomics data [6-8]. The study of multi-omics data allows to 
investigate the relationships, roles, and actions of the various types 
of molecules constituting the cells of an organism and gain a 
comprehensive understanding of the biological system under exam- 
ination. Information from omics data can be used to identify diag- 
nostic and prognostic markers and support the development of 
personalized treatments [9]. Many studies have used omics data 
to develop accurate prognostic models for different cancer types 
[10-12], achieving more precise predictions than conventional 
clinical methods. 

Following the breakthrough in exploring omics data, multiple 
assays from the same set of instances have been recently consoli- 
dated to generate multi-omics data. Their availability has reformed 
the biological and medical fields by making avenues for system-level 
integration tactics. Multi-omics integration has been used with 
great success to understand cancer and other disease progression 
mechanisms, to eventually obtain patient-specific clinical treat- 
ments and prevention strategies [13-15]. In fact, developing algo- 
rithms able to process multi-omics data could provide sharpness on 
biomolecules from different layers, pave the way toward large-scale 
cell optimization, and facilitate the understanding of complex 
biological processes involved in cancer progression 
[16, 17]. Hence, compared to single-omics data, models using 
multi-omics data and mechanism-based approaches can provide a 
deeper understanding of cancer progression and related mechan- 
isms, including discovering novel biomarkers, studying the interac- 
tion with viruses, and detecting cancer subtypes [18-21]. 

Most survival-based molecular models have mainly used a sin- 
gle type of omic data [22]. However, recent investigations found 
that a proper combination of clinical and omics survival data could 
significantly improve clinical outcomes [5 ]. This integration usually 
outperforms the models that rely only on clinical or omics data 
[23]. Hence, it is necessary to investigate the effectiveness of using 
different types of data, such as clinical data, omics data, and their 
integration on the performance of the predictive models. 

Recently, machine learning (ML) models have been successfully 
developed to process biomedical data, including characterization of 
cell phenotype, detection of cancer, and prediction of survival out- 
comes [24-27]. Specifically, ML has been widely applied in clinical 
diagnosis and medical image analysis to develop computer-aided 
diagnosis systems [28]. The volume and variety of clinical and 
genomic data collected from patients are significantly increasing, 
revealing novel opportunities to apply ML and generate more 


Machine Learning Methods for Survival Analysis of Breast Cancer 327 


insights into the molecular investigation of tumors and cancer 
prognosis. ML methods have facilitated the development of a 
more precise landscape about tumor heterogeneity and contributed 
to precision oncology. This allows specific patients to have an 
individual treatment plan based on a personalized diagnostic and 
prognostic risk profile. However, precise diagnosis and treatment of 
breast cancer are still representing one of the main challenges in 
healthcare [29]. Hence, developing accurate prognostic methods is 
necessary to significantly improve risk stratification after diagnosis 
and increase survival expectation. In order to achieve this, several 
patient-specific techniques have been proposed, either relying on 
clinical records, biological markers, or their combinations 
[30, 31]. However, there is still a need to identify the key biomar- 
kers affecting cancer progression and survival outcomes in order to 
develop more accurate personalized treatments. 

Survival analysis is a reliable and widely applied statistical tech- 
nique among prognostic modeling methods, which attempt to 
evaluate the probability of events to occur within a specific time 
[32]. The prediction outcomes of this type of analysis, such as 
cancer death or recurrence, are fundamental to numerous clinical 
judgments in oncology and play an essential role for patients, 
doctors, and scientists [33]. Among the currently available survival 
analysis models, the Cox proportional hazards (CPH) regression 
model is the most widely applied approach to investigate the effect 
of the input features on the survival time of the patients 
[34, 35]. So far, numerous prognostic models have been proposed 
to apply the CPH regression model on clinical and transcriptomic 
data [36] and multi-omics data [37]. However, ML has recently 
shown its successful applications in the medical and healthcare 
fields. Many ML models have been employed in cancer survival 
analysis because of their ability to handle high-dimensional data, 
non-linear relationships, and interaction effects [38, 39]. ML-based 
approaches for survival analysis, such as random survival forests 
[40], gradient boosted survival model [41], survival support vector 
machine [42], Cox-nnet [43], and SALMON [44], have empha- 
sized the feasibility of accurately predicting cancer outcomes using 
clinical and omics data. 

Although survival analysis is widely applied in clinical studies, 
its prediction in practice still relies heavily on the subjective inter- 
pretation of the clinician, limiting reproducibility and accuracy 
[45]. Therefore, this tutorial aims to investigate breast cancer sur- 
vival analysis by proposing a framework based on ML algorithms to 
perform survival analysis using clinical and transcriptomic data. 
CPH model and three ML-based models, namely random survival 
forests (RSF), gradient boosted survival model (GBS), and survival 
support vector machine (SSVM), are implemented and tested on 
the METABRIC dataset [46]. Our objectives are to classify the 
patients into risk groups (i.e., high risk and low risk) and unveil 


328 Le Minh Thao Doan et al. 


2 Backgrounds 


2.1 Survival Analysis 


the prognostic predictors impacting the survival outcomes of 
patients. Consequently, identifying the patient risk groups could 
assist doctors in determining the course of treatment, promoting 
effective therapies, and supporting personalized clinical decision- 
making and recommendation. 

Hence, the aim of our work is twofold: (1) to present a proto- 
col for applying ML algorithms in survival analysis. Specifically, 
elements of the study design, experiment process, and performance 
evaluation criteria are described and outlined to generalize and 
adapt our protocol to other public available clinical and transcrip- 
tomic data and (2) to uncover critical prognostic factors affecting 
the survival likelihood of breast cancer patients by employing the 


most recent statistical techniques for the interpretability of ML 
models (i.e., SHAP values) [47 ]. 


This section presents the main methodologies and ML algorithms 
applied in our tutorial. The main differences between the three ML 
algorithms applied for survival analysis are also discussed. 


Survival analysis is a statistical procedure applied for analyzing the 
expected duration of time until the occurrence of an event of 
interest (e.g., death or disease recurrence). One of the main chal- 
lenges associated with survival analyses consists of dealing with 
censored data, a form of missing information that occurs because 
of the limited observation time, observation withdrawal, or lost to 
follow-up during the study period [48]. Censored data can be 
classified into two groups: left-censored and right-censored data. 
The former occurs when the event has already occurred before the 
beginning of the study, while the latter occurs when the survival 
time is only known to exceed a certain value, but the exact time is 
unknown. Right-censored data is the most common type of cen- 
sored data [48]; therefore, this chapter will focus on the survival 
analysis for right-censored data. 

For a given instance 2, the survival information associated with 
zis comprised of two elements: a binary event indicator £,, in which 
E; = 0 for censored instance and E; = 1 if the event (e.g., death) is 
observed, and a failure event time T;, a non-negative random 
variable representing the duration between the beginning of the 
study and the occurrence of the event. The formula below reports 
the probability of observing the event by time ¢. 


E(t) = Prt < di: (1) 


The function F(t) is defined as the cumulative distribution 
function. 


2.2 Cox Proportional 
Hazards Model 


Machine Learning Methods for Survival Analysis of Breast Cancer 329 


Let the probability density function be denoted as f(z). The 
survival function, S(t), provides the probability that the event is 
observed after time ¢, and it is defined as 


S(t) = Pr[T >t] =1- F(t) = 7 Fle)dee (2) 


The hazard function A(t) represents the probability that the 
event will happen within the interval [¢, ¢+ dt), given that it did 
not occur before time ¢. Thus, a lower hazard corresponds to a 
greater chance of survival. The hazard function /(t) is defined as 


P <T: + dt|T => 
h(t) lim = a (3) 
it—0 dt 
By using the definition of S(z) in Eq. 2, the hazard function can 
also be written as 


W(t) fh = —# tog S(0) (4) 


Survival and hazard functions are two fundamental concepts in 
survival analysis, and they are connected by the expression below 


s(2) _ -f od (5) 


Equation 5 can be derived by integrating the first and last terms 
in Eq. 4 from 0 to ¢. The integral inside the parenthesis in Eq. 5 
describes the sum of the risks of observing the event between time 
0 and time ¢. This quantity is called cumulative hazard, and it is 


defined as H(t) = [oh(x)dx. 


The Cox proportional hazards (CPH) model [49] has been the 
most commonly applied method in clinical studies to investigate 
the relationships between time-to-event or survival-time outcomes 
and explanatory variables. The CPH model is a regression approach 
used to calculate the hazard ratio (HR) and its confidence interval 
between patients belonging to different risk groups. Specifically, 
the HR can be interpreted as a relative risk. The CPH model is a 
semi-parametric model, and it is denoted by the hazard function 
h(t) representing the hazard at time ¢ defined as 


h(t) = ho(t) ei +Paxat + Barve | (6) 


where f(t) is the baseline hazard function, and fj, fo, .., B, are the 
corresponding regression coefficients of covariates x1, X, ..., Xp. 

A value of e4' above 1, or B; above zero, shows that the increase 
in value of the zth covariate will lead to the rise in event hazard and, 
consequently, the reduction in survival-time length. In other 


330 Le Minh Thao Doan et al. 


2.3 Machine 
Learning Models 


words, the covariate is positively correlated with the event likeli- 
hood or negatively associated with the survival-time length. In 
contrast, a value of e% below 1, or £; below zero, shows that an 
increase in value of the zth covariate will lead to a decreased proba- 
bility of observing the event. If e%' is equal to one, that covariate 
does not affect the survival probability. Overall, observing e4' above 
one is a bad prognostic indicator in cancer studies, whereas observ- 
ing e”' below one is a good indicator. 

Let us consider two observations y, v with covariates x,,; and %,;, 
i=1,..., Rand hazard functions defined as 


h,(t) = ho(t) Pix tP2X2+--+PiXuk (7) 


h,(t) = ho(t) eit Paxat.+Bexue (8) 


Using the definitions of ,(¢) and ,(t) in Eqs. 7 and 8, the HR 
for the two observations yp, v is calculated as 


k k 
oe 
HR = h(t) * ho(t)e= = i=l = pina 


h,(t) Sime one 
ho(t)e= el 

Since the HR is not a function of time #, the hazard risk of the 
two groups must remain constant through the whole study, and 
their hazard curves should not cross. In fact, the CPH model is 
based on two assumptions: (1) the survival curves for two or more 
strata must have proportional hazard functions over time ¢ and 
(2) each covariate makes a linear contribution to the model. 


. (9) 


ML models employed for survival analysis have recently received 
increasing interest due to their promising applications in cancer 
research [39]. They are mainly applied to predict survival outcomes 
and the corresponding survival likelihood following statistical sur- 
vival analysis approaches. However, rather than focusing on survival 
curves estimation, ML approaches mainly focus on predicting the 
time-of-event occurrence by merging the traditional statistical sur- 
vival analysis techniques with the most recent statistical models. 
The advantages of using ML algorithms to perform survival analysis 
include the opportunity of providing more accurate solutions 
allowing the analysis of survival data while dealing with the statisti- 
cal challenges associated with high-dimensional data. 

In this tutorial, patient-specific survival risk probabilities are 
predicted using the most recent ML algorithms, including random 
survival forests, gradient boosting model, and survival support 
vector machine, which have recently become popular due to their 
effectiveness in handling survival data [39]. 


2.3.1 Random Survival 
Forests 


Machine Learning Methods for Survival Analysis of Breast Cancer 331 


Random survival forests (RSF) is a random forest-based learning 
method used to analyze right-censored survival data [50]. The 
model uses an ensemble approach to generate predictions by inte- 
grating the estimations of multiple trees. This allows the model to 
gain more precise predictions than using a single tree. The algo- 
rithm employs tree-structured and bagging algorithms, typical of 
the random forest model [51], based on the three steps below: 


1. Arandom bootstrap sample from the training set is selected to 
grow a tree. 


2. The tree nodes are divided by a random attribute selection 
rather than using all the features available in the dataset. 


3. The prediction of the random forest algorithm is determined 
by averaging the predictions of the individual tree. 


Consequently, each tree in the forest is grown on an indepen- 
dent bootstrap sample extracted from the training data. This model 
is more independent and lowers the correlation between features, 
thus reducing the variance of the unbiased base learners occurring 
when using a single decision tree, and gaining better predictive 
performance. This technique aggregates different trees’ decisions, 
and it often offers a better generalization. The random forest 
algorithm has been demonstrated to be a widely adopted and 
effective ML technique for high-dimensional data, and it is 
regarded as one of the most successful ensemble methods [52]. 

RSF extends the above approach by integrating censored infor- 
mation from survival data into the splitting rules applied for the 
growth of the forest. RSF is one of the most powerful and widely 
used learning algorithms for survival analysis. Each survival tree 
splitting employs the log-rank splitting rule to develop a set of 
survival trees, maximizing the log-rank test statistic. Other splitting 
rules, such as log-rank score or conservation of events, can be used 
during the growing phase of the forest. However, log-rank splitting 
is the most popular technique, and it is the focus of this tutorial 
algorithm. 

According to Ishwaran et al. [50], the description of the RSF 
algorithm can be summarized as below: 


1. The number of trees 7 in the forest and the number of pre- 
dictors k for the splitting of each node are defined. 


2. n bootstrap samples from data are drawn. Each sampling 
excludes out-of-bag data, which can be proven to be approxi- 
mately equal to 37% of the full dataset [53]. 


3. A survival tree in each bootstrap sample is grown using the 
following approach: 


332 Le Minh Thao Doan et al. 


2.3.2 Gradient Boosted 
Survival 


e kcandidate predictor variables are randomly chosen. 


e For the possible splitting point of each k, the log-rank 
statistic is computed. 


e The node is split based on the log-rank splitting rule that 
maximizes the survival difference between children nodes. 


e The tree continues to grow to full size under the constraint 
that the number of event observations (e.g., deaths) in each 
node is greater than a predefined minimum terminal 
node size. 


4, A cumulative hazard function is computed for each tree. Then, 
the results are averaged to estimate the ensemble cumulative 
hazard function for all trees. 


5. Harrell’s concordance index [54] is calculated on the out-of- 
bag data and used to determine the predictive accuracy of the 
model. 


Gradient boosted survival analysis (GBS) is a gradient boosting 
machine learning model applied to analyze censored data 
[55]. The predictive algorithm is based on an additive regression 
model of sequentially fitted weak learners (base learners) while 
minimizing the loss function. It is thus regarded as an ensemble 
learning method. It is a nonparametric approach and does not 
require any functional form assumption, providing researchers 
with more flexibility than other survival models. GBS also generates 
more robust returns than one single learner as it consolidates pre- 
dictions from various estimations of weak learners. 

In GBS, each successive tree is an enhancement over the previ- 
ous one. In other words, the second tree improves over the first tree 
by learning from the residual of its prediction, while the third tree 
enhances over the first and second ones and so on. The outcome is 
estimated by the weighted sum of all the predicted values given by 
the individual trees. 

The gradient boosting algorithm can be summarized as below 


[56, 57]: 


1. The number of iterations MV, the base learner model /(x, 8), and 
the loss function (y, f) are defined, where (x, yey is the input 
data, @ are the parameter estimates, and f is the unknown 
function that maps the input variables x to the target variables 4. 


2. An initial random guess f, of the unknown function f is 


defined. 
3. For each iteration k, the following steps are performed: 


e The negative gradient of the loss function at iteration k is 
calculated. 


e The new base learner function /(x, @;) is fitted. 


2.3.3 Survival Support 
Vector Machine 


2.4 Feature Selection 


Machine Learning Methods for Survival Analysis of Breast Cancer 333 


e The best gradient descent step size p; is estimated as follows: 


N 7 
Pr = arg min, Xi Uyrfxs (mi) + ph(xi, Oe)]. (10) 
i=l 


¢ The estimated function f, is updated as f,=f,)+ 
phx, Ox) : 
4. The output of the final model is defined as f(«) = 37!) f;,(2). 


In this tutorial, we use regression trees as the base learner 
model and CPH as the loss function in the GBS [58]. By doing 
this, the hazard, survival function, and log-hazard ratio are esti- 
mated by summing up the prediction of each regression tree. 


Support vector machine (SVM) is a very popular supervised 
learning method for regression and classification problems. SVM 
has also been applied to censored data for survival analysis 
[59]. The central idea of SVM is to classify data points by maximiz- 
ing the margin between groups in a high-dimensional space and 
finding a separating hyperplane that minimizes misclassification 
[60]. The hyperplane separates the classes and is as far from the 
closest observations as possible. Then, support vectors are defined 
as the data nearest to the maximum margin hyperplane. 

Survival support vector machine (SSVM) follows the same 
approach as SVM, but it employs an asymmetric penalty function 
to handle survival data. Specifically, linear SVM can be adapted to 
solve survival analysis by ranking, regression, and hybrid 
approaches. In a ranking approach, the learning model assigns a 
lower rank to instances with a shorter time of an event by examin- 
ing all possible combinations of instances in the training data while 
predicting the exact survival times in the regression problem. 
Because of its efficiency and optimal performance, this work focuses 
on linear SSVM to handle survival analysis problems. We apply a 
more efficient SVM algorithm called FastSVM [61 ]. This model has 
lower computational training costs since it is based on truncated 
Newton optimization and order statistic trees. 


When working with transcriptomic data, the number of features 
often exceeds significantly the number of observations leading ML 
algorithms to overfit the data and report poor performance. For 
this reason, several feature selection techniques have been pro- 
posed, with the aim of identifying and selecting an optimal subset 
of features. The most widely applied feature selection models 
include Pearson correlation, Spearman correlation [62], principal 
component analysis (PCA) [63], and genetic algorithm 
(GA) [64]. However, Schemper et al. [65] have shown that the 
Pearson and Spearman correlation models are unsuitable to work 
with censored data. Besides, dimensionality reduction methods, 


334 


Le Minh Thao Doan et al. 


such as PCA, are difficult to interpret and are more suitable to use 
for the linear or approximately linear high-dimensional data [66], 
while GA-based wrapper techniques have low computational effi- 
ciency [67]. Maximum relevance and minimum redundancy 
(mRMR) [68], a technique applied to select features based on 
their correlation with the response variables, has the advantage of 
fast computation and stronger robustness than the above feature 
selection techniques. Hence, mRMR is applied in this tutorial. 
According to Peng et al. [68], the model ranks the features accord- 
ing to both their relevance to the outcome and the low correlation 
between themselves. The steps performed by the mRMR algorithm 
are described below. First, mRMR identifies the first feature based 
on the maximum relevance value. 

Let I be the mutual information (MI) to measure both rele- 
vance and redundancy between features. The MI of two random 
features m and 7 is given by 


p(m,n) 

I(m, n) J Spm, n) log 2(m) pn) dm dn, (11) 
where p(m), p(m), and p(m, m) denote the probabilistic density 
function of m, m and their joint probabilistic density function, 
respectively. 

Next, let X denote the whole feature set, while S denote the 
selected feature set containing s features, and cis the outcome class. 
For an individual feature «;, I(«;, c) denotes its MI with the class c. 
The maximum relevance criterion, reflecting the largest depen- 
dence of x; on the target class c, is computed by 


max D(S,¢), D(S,c) = Ty > fone) (12) 


xES 


The features selected by the maximum relevance criterion are 
likely to have large dependency among them. Hence, the minimum 
redundancy condition is added and calculated by 


min R(S), R(S) = 755 Yo ix) (13) 


xi, xj7ES 


where I(x;, x;) is the MI of feature x; and .;. 

The final mRMR feature set is chosen by simultaneously opti- 
mizing Eqs. 12 and 13. An incremental search approach is used to 
find the near-optimal features. Let us consider the S,_; feature set 
with s—1 features already identified. The sth feature is selected 
from the remainder feature set {X— S,_;} by optimizing the fol- 
lowing condition: 


max | I(;; ¢) — S- EE Ng) | (14) 


x;EX—S, 1 
7 . xpES. 


3 Methods 


3.1 Dataset 


3.2 Study Design 


Machine Learning Methods for Survival Analysis of Breast Cancer 335 


This section describes the dataset used in this tutorial, and the 
experiments run to perform survival analysis. We conducted three 
experiments to investigate the performance of CPH- and 
ML-based models on clinical data, transcriptomic data, and the 
integration of the two data types. First, we report the dataset 
description, and then the study design, initial setting, and details 
of three experiments are discussed. 


METABRIC dataset [46] was used to assess the predictive perfor- 
mance of the CPH model and the ML methods implemented in 
this chapter. The dataset has been downloaded from cBioPortal 
(www.cbioportal.org/datasets). The tumor information in the 
original METABRIC study was collected from five centers in the 
UK and Canada. The objective of the study was to analyze the effect 
of genomic and transcriptomic profiles on breast cancer survival to 
discover the optimal treatment approach of patients. The dataset 
contains clinical information for 2509 primary breast cancer sam- 
ples and 2509 molecular profiling, including 1904 transcriptomic 
data with a maximum follow-up period of 355 months. Clinical 
data was obtained from cohort studies and trials, including the 
survival time in months and status (deceased or censored), while 
gene expression data was extracted from mRNAseq, which provides 
a snapshot of the transcript abundance of different gene transcripts 
of the cell. The detailed description of tissue specimens and staging 
can be found in the original METABRIC study of Curtis et al. 
[69 ]. To explore the power of CPH and ML models, we considered 
all the clinical and transcriptomic covariates available in the dataset. 


We set up three experiments to evaluate the CPH and ML models 
for survival analysis, including (1) clinical data, (2) transcriptomic 
data, and (3) integrating clinical and transcriptomic data. Python 
programming language (version 3.8.8) and its libraries on the 
Anaconda environment (version 4.10.3) were used to conduct the 
experiments. Python 3 can be run on any popular operating system 
such as Windows, Mac, and Linux. However, the steps in this 
chapter are demonstrated on a Windows 10 Pro—64-bit operating 
system. 

We separately implemented all the experiments in Jupyter note- 
books, an open-source web application that integrates code, visua- 
lizations, computational output, and other resources in one single 
file. However, the code can also be efficiently run online on Google 
Colab (https://colab.research.google.com) without any software 
installation. 


336 Le Minh Thao Doan et al. 


3.3 Initial Setting 


The CPH and the current state-of-the-art ML algorithms pre- 
sented in Subheading 2 were applied and evaluated in this chapter 
using scikit-survival packages [70] for Python. 

The procedure of the experiments followed in this tutorial is 
described below and presented in Fig. 1: 


Step1:. The libraries and packages required for the analysis are 
first installed and imported. 


Step2:. The METABRIC dataset is loaded. 


Step3:. Preprocessing steps and data exploration techniques are 
performed to investigate the dataset. 


Step4:. Feature selection is applied to select the optimal number 
of features from a large set of variables (this step is applied 
only to the transcriptomic data, where data reduction is 
necessary to improve the performance of models). 


Step5:. The CPH model is run, and the results are plotted and 
interpreted. 


Step6:. ML algorithms are set up and run to generate the final 
predictive models. 


Step7:. Results are interpreted using SHAP values, and models are 
compared. 


The outputs of the survival ML models are patient-specific risk 
scores, which incorporate OS time and the corresponding event 
censorship indicator. A higher risk score indicates a greater likeli- 
hood of observing the event of interest (e.g., decease) early. There- 
fore, it is necessary to find an appropriate metric to evaluate the 
performance of models based on such predicted risk scores. 

Harrell’s concordance index (C-index) [54], a goodness of fit 
for survival models, is used to measure the concordance probability 
P(n;>ni|T;> Tj) for two instances 7 and 7 to rank association 
between their OS time points T;, J; and the models’ prediction ;, 
nj. It assesses the possibility for a random observations pair that the 
patient with a higher risk score is the one that has a shorter survival 
time. Hence, it estimates how well a model predicts the ordering of 
decease times of patients. C-index values range from 0 to 1, where a 
value of 0.5 corresponds to a random model or no predictive 
discrimination. In contrast, C-index equal to 1 implies a precise 
association or perfect ranking of the observed and predicted sur- 
vival times. 


Before starting the analysis, the folders named Data and Plot are 
required to be set up in your local machine to store all data and 
figures for the experiments. Then, Python 3 [71] needs to be 


Machine Learning Methods for Survival Analysis of Breast Cancer 337 


Machine Learning Protocol for Survival Analysis 


Step 1: Set ill a8. mm : 
z iil 
<30mins pandas scikit-survival {3 LIFELINES : Step 2: Get data 
; x https://www.cbloportal.org 
”< 10 mins / é. 
Step 3: Preprocess + : “Lis 
Z Explore data : ~ | oe 
30 mins : Missing values : Clinical data Transcriptomic data 
i= Duplicated values ° 
Encode data —_—_—XK—“—“_ 
. | * Step 4: Feature Selection 
bila 2 a 
© 45 mins opi : 
Step 5:Cox Proportional ; 
Hazards Model Plot : 
z : mRMR 
caine = Hazard Ratio P 
oe eens’, Step 6: Machine Learning 


for Survival Analysis 


x - 
2 - 3 hours ; st 


y 


Rreene Model Evaluation 


BET Yr itie perry 
ij 


Step 7: Interpret results 


z : Or C-index - a = a | SS 
20 hours SEs —? , 
4 ae 
i major risk factors High Rsk 
aes Pe Survival risk ‘les - & 
Shapley values ee Risk 


So 


Fig. 1 Tutorial Workflow. After setting the environment (step 1), clinical and transcriptomic data was retrieved 
from cBioPortal (step 2). To start the experiment, we loaded the data, followed by data cleaning and data 
exploration steps (step 3). Due to the high-dimensional nature of transcriptomic data, a feature extraction 
step was required before running the machine learning models (step 4). Next, the CPH model was run, and the 
results were plotted to investigate the HR and p-value associated with each risk factor (step 5). Then, we 
built, trained, and evaluated the ML models for survival analysis. Patients were then divided into high- and 
low-risk groups based on the predicted risks scores, and the survival risk differences between groups were 
investigated (step 6). Finally, the top critical prognostic markers were identified and interpreted using SHAP 
values (step 7) 


338 


Le Minh Thao Doan et al. 


install 
install 
install 
install 
install 


install 
install 


install 
install 
install 


installed. The software is free and can be downloaded from www. 
python.org/downloads/. We recommend using the Anaconda 
environment for Python and its libraries to run the experiments 
presented in this project. 

The data files used for this project and the complete codes 
notebooks are available at (https://github.com/Angione-Lab/ 
survival_analysis_tutorial). The repository includes the clinical and 
transcriptomic data (1.e., data_clinical_patient.csv, data_clinical_- 
sample.csv, data_mRNA_median_all_sample_Zscores.csv), which 
are required to run the following steps. 

After creating a new notebook, a new cell/field to run the 
codes needs to be created. By clicking on the “Run Cell” button, 
the code will be executed cell-by-cell. Finally, libraries and packages 
of Python need to be installed as shown below: 


dataprep # data exploration 
scikit-survival # survival analysis 


lifelines # plotting survival analysis 
gitthttps://github.com/smazzanti/mrmr # feature selection 
shap # model interpretation 


Other than the above packages, some primary data preproces- 
sing and visualization libraries such as Pandas, NumPy, Matplotlib, 
and Seaborn are expected to be installed if the code is run on a local 
machine. The syntax !pip install + libraries_names can be followed 
to install the preliminary packages. 


pandas # loading and preprocessing data 
numpy # loading and preprocessing data 
matplotlib # visualisation 

seaborn # visualisation 

-U scikit-learn # preparing ML algorithms 


Once the required libraries are installed, they need to be 
imported at the beginning of the notebook to use the relevant 
functions. 


Machine Learning Methods for Survival Analysis of Breast Cancer 339 


# Packages to load and preprocess data 
import numpy as np 
import pandas as pd 


# Packages to visualise and explore data 

import seaborn as sns 

sns.set_style("whitegrid") 

import matplotlib.pyplot as plt 

from dataprep.eda import plot, create_report, plot_missing, 
plot_correlation 


# Feature selection 
from mrmr import mrmr_classif 


# Packages to prepare data for ML 

from sklearn. preprocessing import OrdinalEncoder 

from sklearn.model_selection import GridSearchCV, KFold 
from sklearn.model_selection import train_test_split 
from sklearn. preprocessing import MinMaxScaler 

from sklearn.pipeline import Pipeline 


# Packages for survival analysis 

from lifelines import CoxPHFitter 

from lifelines.utils import k_fold_cross_validation 
from lifelines.statistics import logrank_test 

from lifelines import KaplanMeierFitter 

from lifelines.plotting import add_at_risk_counts 


# Packages for ML in survival analysis 

from sksurv.linear_model import CoxPHSurvivalAnalysis 

from sksurv.svm import FastSurvivalSVM 

from sksurv.ensemble import RandomSurvivalForest 

from sksurv.ensemble import GradientBoostingSurvivalAnalysis 
from sksurv.metrics import concordance_index_censored 


# Package to interpret data 
import shap 


3.4 Experiment 1: Following the workflow presented in Fig. 1, Experiment 1 was 
Clinical Data conducted to perform survival analysis on the clinical data. The 
data was first loaded into a data frame for data cleaning and explor- 
atory data analysis (EDA). Then, the CPH model and ML models 
were trained and evaluated to predict the survival risk of the 
patients. Finally, the results were interpreted to identify the critical 
clinical factors associated with low survival of the breast cancer 
patient. 
The details of the experiment are presented in the following 
sections. 


3.4.1 Load Data The patients information used in our analysis is stored in two files, 
one with clinical information and the other with the demographic 


340 Le Minh Thao Doan et al. 


PATIENT ID LYMPH NODES EXAMINED POSITIVE NPI CELLULARITY CHEMOTHERAPY COHORT ER_IHC HER2 SNP6 


MB-0000 1 6.044 NaN N 1 Positve NEUTRAL 


lumn 


Fig. 2 First five rows of the merged data frame. The data is presented in a table with the clinical features as 
columns and patients as rows. As the data frame comprised many columns, only the first eight columns are 
displayed in this figure 


characteristics of the patients. These two files need to be merged 
into a single data frame for easy processing. 


# Load data 
filei = pd.read_csv(’Data/data_clinical_patient.csv’) 
file2 = pd.read_csv(’Data/data_clinical_sample.csv’) 


# Merge clinical data 
data = pd.merge(filei,file2, how="inner", on=["PATIENT_ID"]) 


Once the data was loaded and merged, the first five rows of the 
new data frame and its information were extracted to get an over- 
view of the data using the lines below. The outcome is reported in 
Fig. 2. 


# Have a quick look at data 
data. head () 


Next, an overview of the data frame information can be gener- 
ated by running the lines below. The output is displayed in Fig. 3. 


# Data information 
data. info () 


The data contained some missing values; hence, it is essential to 
understand the data and preprocess it carefully before implement- 
ing any predictive models. In the next section, different techniques 
to explore and clean the data are performed. 


3.4.2 Preprocess and In order to save computation time, duplicate observations and 

Explore Data unused columns were dropped before conducting exploratory 
data analysis. This is one of the fundamental data cleaning steps to 
prepare the data for further analysis: 


Machine Learning Methods for Survival Analysis of Breast Cancer 


<class 'pandas.core.frame.DataFrame'> 
Int64Index: 2509 entries, 0 to 2508 
Data columns (total 36 columns): 


t Column 


Non-Null Count 


QO PATIENT_ID 2509 
1 LYMPH_NODES_EXAMINED_POSITIVE 2243 
2 NPI 2287 
3 CELLULARITY 1917 
4 CHEMOTHERAPY 1980 
5 COHORT 2498 
6  ER_IHC 2426 
7  HER2_SNP6 1980 
8 HORMONE_THERAPY 1980 
9 INFERRED_MENOPAUSAL_STATE 1980 
10 SEX 2509 
11 INTCLUST 1980 
12 AGE_AT_DIAGNOSIS 2498 
13 OS_MONTHS 1981 
14 OS_STATUS 1981 
15 CLAUDIN_SUBTYPE 1980 
16 THREEGENE 1764 
17 VITAL_STATUS 1980 
18 LATERALITY 1870 
19 RADIO_THERAPY 1980 
20 HISTOLOGICAL_SUBTYPE 2374 
21 BREAST_SURGERY 1955 


non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 


object 
float64 
float64 
object 
object 
object 
object 
object 
object 
object 
object 


341 


Fig. 3 Clinical data information. The figure reports an overview of the clinical data frame, including total 
entries, data types, the names of the columns, and the number of validated data points. There are 2509 entries 
and 36 columns in the clinical data frame. The first 21 columns are shown in this figure, which include two 
types of data: (1) float or numeric and (2) object or non-numeric. Some columns contained missing values 
such as LYMPH_NODES_EXAMINED_POSITIVE, and NPI. This analysis provides a useful summary of the data 
before implementing any preprocessing steps 


VITAL_STATUS and SAMPLE_ID columns were dropped 
because they reported the same information as OS_STATUS 
and PATIENT ID, respectively. 


SEX and SAMPLE_TYPE columns had only a single value; 
hence, they were not providing any useful information for the 


predictive models and they were removed. 


RSF_STATUS and RSF_MONTHS were derived variables and 
were not used in our survival analysis. 


342 Le Minh Thao Doan et al. 


Group Patients by CANCER_TYPE 
Breast Cancer 2506 
Breast Sarcoma 3 
Name: PATIENT_ID, dtype: int64 


After the preprocessing, the shape of data is: (2506, 29) 


Fig. 4 Output of preprocessing step. Only three breast sarcoma samples were present in the data; therefore, 
we dropped those three samples and left only one single value in the CANCER_TYPE column, i.e., normal 
breast cancer. As a result, since the CANCER_TYPE column reported the same value for all the samples, and it 
did not add any extra information about the samples, the column was removed and not included in the future 
steps of the analysis. Finally, after preprocessing, the final dataset consisted of 2506 samples and 29 features 


# Drop unused columns: Based on data.info(), we will drop some unused 
cols and null cols 


drop_list = [*VITAL_STATUS®, *SAMPLE_ID’*®, *SEX®, *SAMPLE_TYPE®, ° 
RSF_STATUS’, *RSF_MONTHS’] 
data = data.drop(drop_list, axis=1) 


We also checked the number of patients for each cancer type 
since the target of our study is breast cancer. The dataset included 
some breast sarcoma instances, a sporadic form of breast cancer. 
However, since a normal breast cancer prognosis is our primary 
objective, the data was filtered by CANCER_TYPE to keep normal 
breast cancer only. The lines below show the implementation of the 
filtering steps. The output of these steps is reported in Fig. 4. 


# We check the number of patients by cancer type 
print(’\nGroup Patients by’,data.groupby(’CANCER_TYPE’)[’PATIENT_ID’]. 
count () ) 


# There are only three patients with Breast Sarcoma 
# So we will filter those patients with Breast Cancer type 
data = data[data[’CANCER_TYPE’] == ’Breast Cancer’] 


# Delete Cancer type columns as this column reports the same value for 
all the samples, and it does not bring any useful information for 
the following steps of the analysis. 

data = data.drop([’CANCER_TYPE’], axis=1) 

print(’\nAfter the preprocessing, the shape of data is:’, data.shape) ) 


Before continuing the preprocessing phase (step 2 in Fig. 1), 
data was explored to investigate data types, data distribution, and 
missing values. The library dataprep was used for exploratory data 
analysis (EDA). Other options to explore specific parts of the 
report, such as missing values and data distribution, were also 
used, as shown in the code below. 


Machine Learning Methods for Survival Analysis of Breast Cancer 343 


# Understand data 
# Save to report as html file 
create_report (data).save(’Plot/EDA_clinical_report’) 


# Optional to explore parts of the report 
plot_missing (data) .save(’Plot/missing_values.html’) 
plot (data).save(’Plot/data.html’) 


The library generates an interactive EDA report that can be 
exported as an HTML file, as shown in Fig. 5, and opened in a web 
browser. This is a comprehensive report presenting all information 
about the features in the data frame. Besides the comprehensive 


DataPrep Report Overview 


Dataset Statistics Dataset Insights 
Number of Variables 2 (WPH_NCOES_DUATWES_pOstTIvE Nas 264 (10.53%) missing values hissing | 
Number of Rows 2306 wt has 222 (8.86%) missing values ca 
Missing Celts 10088 CHLLWLAATTY has 591 (23.58%) missing values c= 
Missing Celts (%) 13.9% Garomanary has 529 (21.11%) missing values [ sessing | 
OCuphcate Rows ° fa_mc | has 80 (3.19%) missing values [ bessing | 
Duplicate Rows (%) 00% san2_26 has 529 (21.11%) missing values c= 
Total Size in Memory 3.0mMe scrvout_teteary has 529 (21.11°G) missing values [ tassing | 
Average Row Size in Memory 12KB TNPERRED_PENOPAUSaL_sTaTE NSS 529 (21.11%) missing values | Missing | 
tatcLwsT has 529 (21.11%) missing values C= 
——— Categorical: 22 
Numerical: 7 ospowne Nas $28 (21.07%) missing values  issino | 
| 
Variables 
PATIENT_IO 
Approximate Distinct Count 2506 Os 
imate Unique 100.0% 5 06 
PATIENT_ID =—_ ™ naa a. 
emogoncal Messing r on 
Show Dotes’s Missing (%) 0.0%, 
San LAnID/ Inn Gan a 
Memory Size 177.2 KB & § Yt 
C PATIENT_IO 
LYMPH_NODES_ EXAMINED POSITIVE 
Approximate Distinct Count 2 Meon 1.9514 | 
Approximate Unique (%) 14% Minium ° 
00 
Mis: Maxierum “6 "7 
LYMPH_NODES.... ton -_ es 
ap Missing (%) 10.5% Zeros 1195 a 
c 
Show Detanis Infinite ° Zeros ("%) 411% 
Infinite (°%) 0.0% Negatives ° Py | 
Memory Size 350 KE Negatives (%) 0.0% ° 


Fig. 5 EDA report for clinical data. The report shows that the dataset consists of 29 features (22 categorical 
and 7 numerical features) and 2506 rows. There are no duplicate rows, and 10,088 missing values account for 
13.9% of the data. Besides, the report also reveals insights for each column (top-right panel), such as the 
number of missing values, skewness, unique number of values, and statistical summary. The distribution of 
each variable and information about missing values are also provided in the final report (bottom panel) 


344 Le Minh Thao Doan et al. 


Bar Chart Spectrum Heat Map Dendogram 


Bl Present 
BB Missing 


Row Count 


Fig. 6 Missing values chart. The plot shows the missing values information by column. In the stacked column 
chart, the orange section represents the number of blank rows, whereas the blue represents the non-blank 
ones. As shown in the chart, no columns have more than 50% of the missing values 


report, the library allows to extract specific parts of the report. This 
can be achieved by selecting “Missing Values” in the menu at the 
top of the report page. For instance, the percentage of missing 
values for each variable is illustrated in Fig. 6. Once an overview 
of the data had been obtained, the next step was dealing with 
missing values. Our strategy was to remove the rows and columns 
with more than 50% of the missing values. Figure 6 shows that 
there were no columns with more than 50% of the missing values. 

We removed the rows with more than 50% of the blank values 
by running the lines of codes below. The output is reported in the 
first two rows of Fig. 7. 


# Deal with missing values 

# There is no columns more than 50% missing value 

cols_mv_50 = data.columns[data.isnull().mean() >0.5] 

print(’Number of columns having more 50% missing data’, len(cols_mv_50) 


) 


# Remove row with more than 50% missing 

percent = 50 

min_count = int(((100-percent)/100)*data.shape[1] + 1) 

data = data.dropna(axis=0, thresh=min_count) 

print(’After removing rows with more than 50% missing value:’, data. 
shape) 


Machine Learning Methods for Survival Analysis of Breast Cancer 345 


Number of columns having more 50% missing data: 0 
After removing rows with more than 50% missing values: (1977, 29) 
List columns having missing data: Index(['LYMPH_NODES_EXAMINED_POSITIVE', 
'CELLULARITY', 'ER_IHC', 'THREEGENE', 
"LATERALITY', 'HISTOLOGICAL_SUBTYPE', 'BREAST_SURGERY', 'GRADE', 
'TUMOR_SIZE', 'TUMOR_STAGE'], 
dtype='object') 
After preprocessing, missing value number: 0 


Fig. 7 Output of dealing with missing values steps. There are no columns that missed more than 50% of the 
values. After removing the rows with more than 50% of the missing values, the data’s remaining rows are 
1977. Also, 10 columns, namely LYMPH_NODES_EXAMINED_POSITIVE, CELLULARITY, ER_IHC, THREEGENE, 
LATERALITY, HISTOLOGICAL_SUBTYPE, GRADE, TUMOR_SIZE, TUMOR_STAGE, BREAST_SURGERY, contain 
missing values, which are replaced by their mode (if categorical) or their average (if numeric). Once the 
preprocessing steps are completed, no missing values are found in the dataset 


After removing the rows with more than 50% of the missing 
values, we replaced any missing values with their mode (for cate- 
gorical variables) and their average (for numeric variables). To 
achieve this, first, we identified which columns contained blanks 
and classified them into either categorical or continuous numeric 
types. Once this step was completed, we checked again the number 
of missing values to ensure there were no other missing values in 
the data. The output of the below codes is displayed in Fig. 7. 


# Print columns name having blanks 
cols_missvalue = data.columns[data.isnull().sum() >0] 
print(’List columns having missing data:’, cols_missvalue) 


cat_var = [’LYMPH_NODES_EXAMINED_POSITIVE’, ’CELLULARITY’, ’ER_IHC’, ° 
THREEGENE’, ’LATERALITY’, ’HISTOLOGICAL_SUBTYPE’, ’BREAST_SURGERY’, 
>GRADE’, *TUMOR_STAGE’] 
num_var = [’TUMOR_SIZE’] 


# Replace missing values with most frequent values 
data[cat_var] = data[cat_var].fillna(data[cat_var].mode().iloc[0]) 


# Replace missing values with average values 
data[num_var] = data[num_var].fillna(data[num_var].mean()) 


# Check missing values again 
print(’Missing value number:’, data.isna().sum().sum()) 


346 Le Minh Thao Doan et al. 


Mace at_omcnosis a 


f°Tl ts 


Fig. 8 Features distribution in clinical data. Excluding PAT/ENT_/D, the remaining 28 features, including 
OS_MONTHS and OS_STATUS, are plotted. The figure shows that L YMPH_NODES_EXAMINED_POSITIVE and 
OS_MONTHS are right-skewed, while the CELLULARITY values, i.e., the amount of tumor cells, are mostly 


high, followed by moderate and low status 


Before moving into the next step of the pipeline, distribution 
visuals for each variable were plotted using the below line of code. 


The output is reported in Fig. 8. 


# Exploring clean data 


plot(data.iloc[: ,1:]).save(’Plot\preprocessed_data.html’) 


In the following step (step 2 in Fig. 1), some categorical 
variables were encoded to numeric to process and analyze data. 
First, we prepared a list of features/columns to be encoded. Then 
the OrdinalEncoder function was used to transform the columns in 


the list into numeric. 


Machine Learning Methods for Survival Analysis of Breast Cancer 347 


# Encode categorical data 


# Encode OS status to dummy 
data[’0OS_STATUS ’]=np. where (data[’0S_STATUS’]==’1:DECEASED’, 1, 0) 


# Encode other categorical variables 

other_var = [’LYMPH_NODES_EXAMINED_POSITIVE’, ’NPI’,’AGE_AT_DIAGNOSIS’, 
>COHORT’, ’GRADE’, ’TUMOR_SIZE’, ’*TUMOR_STAGE’, ®TMB_NONSYNONYMOUS 
?, ?OS_MONTHS’, ’OS_STATUS’,’PATIENT_ID’] 

df_encode = data.drop(other_var, axis=1) 


# Some variables’ values are not in order, so we have to specify the 
variables and their corresponding orders 

modified_list =[’CELLULARITY’, ’*HER2_SNP6’, ’INFERRED_MENOPAUSAL_STATE’ 
, ?INTCLUST’?, *THREEGENE?’] 


keep_list = df_encode.columns [~df_encode.columns.isin(modified_list)] 

cel_cat = [’Low’, ’Moderate’, ’High’] 

her2_cat = [’UNDEF’, ’LOSS’, ’NEUTRAL’, ’GAIN?’] 

inf_cat = [’Pre’, ’Post’] 

inticliustcatem emt 222) eal SER ose a eR tye Demo mcrier Sits amon. 
2107] 


three_gene_cat = [’ER-/HER2-’, ’*HER2+’, ’ER+/HER2- Low Prolif’, ’ER+/ 
HER2- High Prolif’] 


# Encode the predefined order variables 


enc = OrdinalEncoder(categories=[cel_cat, her2_cat, inf_cat, 
intclust_cat, three_gene_cat]).fit(df_encode [modified_list]) 
encoder = enc.transform(df_encode [modified_list]) 


df_encode_new = pd.DataFrame(encoder, columns=modified_list) 


# Encode the other variables 

enci = OrdinalEncoder().fit(df_encode[keep_list]) 

encoderi = enci.transform(df_encode[keep_list]) 
df_encode_new1l = pd.DataFrame(encoderi, columns=keep_list) 


Finally, the columns were concatenated to the other numeric 
columns to generate the final data frame. 


# Merge encode data and original data 

df =pd.concat([df_encode_new, df_encode_newi, data[other_var]. 
reset_index(drop=True)], axis=1) 

print (df. shape) 


To check the mapping between the encoded categories and the 
original ones, the code below can be executed. 


# To check the encoded categories 
for i in range(len(col)): 


print(col[i], enc.categories_[i]) 
for i in range(len(keep_list)): 
print (keep_list[i], enci.categories_[i]) 


348 Le Minh Thao Doan et al. 


CELLULARITY ['Low' 'Moderate' 'High'] 
HER2_SNP6 ['UNDEF' 'LOSS' 'NEUTRAL' 'GAIN'] 
INFERRED_MENOPAUSAL_STATE ['Pre' 'Post'] 
INTCLUST ['1' '2" '3' '4ER+' '4ER-' '5' '6" "7! '8* "9! 110") 
THREEGENE ['ER-/HER2-' 'HER2+' 'ER+/HER2- Low Prolif' 'ER+/HER2- High Prolif'] 
CHEMOTHERAPY ['NO' 'YES'] 
ER_IHC ['Negative' 'Positve'] 
HORMONE_THERAPY ['NO' 'YES'] 
CLAUDIN_SUBTYPE ['Basal' 'Her2' 'LumA' 'LumB' 'NC' 'Normal' 'claudin-low'] 
LATERALITY ['Left' 'Right'] 
RADIO_THERAPY ['NO' 'YES'] 
HISTOLOGICAL_SUBTYPE ['Ductal/NST' 'Lobular' 'Medullary' 'Metaplastic' 'Mixed' 
‘Mucinous' 
'Other' 'Tubular/ cribriform'] 
BREAST_SURGERY ['BREAST CONSERVING' 'MASTECTOMY'] 
CANCER_TYPE_DETAILED ['Breast' 'Breast Invasive Ductal Carcinoma’ 
‘Breast Invasive Lobular Carcinoma' 
‘Breast Invasive Mixed Mucinous Carcinoma' 
‘Breast Mixed Ductal and Lobular Carcinoma' ‘Invasive Breast Carcinoma’ 
'Metaplastic Breast Cancer'] 
ER_STATUS ['Negative' 'Positive'] 
HER2_STATUS ['Negative' 'Positive'] 
ONCOTREE_CODE ['BRCA' 'BREAST' 'IDC' 'ILC' 'IMMC' 'MBC' 'MDLC'] 
PR_STATUS ['Negative' 'Positive'] 


Fig. 9 Output of the mapping between the encoded and original categories. The figure shows the original 
values for each encoded categorical column. The original categories are presented in ascending order based 
on their corresponding encoded values 


Next, the clean data was saved in a CSV file, clinical.csy, in the 
Data folder to be used for the following analysis and to be 
integrated with transcriptomic data (Fig. 9). 


# Save preprocess data to csv to merge to gene data 
df.to_csv(’Data/clinical.csv’, index=False) 


Correlation analysis was performed to understand the relation- 
ship between features in the data. The heat map in Fig. 10 presents 
the Pearson correlation matrix where the varying intensity of color 
represents the values of correlation. There were some highly corre- 
lated features observed in the data, such as ER_STATUS and 
ER_IHC, and AGE_AT_DIAGNOSIS and INFERRED_MENO- 
PAUSAL_STATE. 


# Drop Patient ID column as this is not relevant for the analysis 
df = df.drop([’PATIENT_ID’], axis=1) 


# Correlation analysis 
colormap = plt.cm.Reds 
plt.figure(figsize=(12,10)) 
sns.heatmap(df.corr() , linewidths=0.1,vmax=0.8, 

square=True, cmap = colormap, linecolor=’white’) 
plt.title(’Correlation matrix’, fontsize=14) 
plt .show() 


Machine Learning Methods for Survival Analysis of Breast Cancer 349 


Correlation matrix 


CELLULARITY [i 

HER2_SNP6 | | 
INFERRED_MENOPAUSAL_STATE |_| 
INTCLUST |) 

THREEGENE |_| 

CHEMOTHERAPY || 

ER_IHC | 

HORMONE_THERAPY || 
CLAUDIN_SUBTYPE 
LATERALITY |) 

RADIO_THERAPY 
HISTOLOGICAL_SUBTYPE 
BREAST_SURGERY 
CANCER_TYPE_DETAILED 
ER_STATUS | _ 

HER2 STATUS |_| 

ONCOTREE_CODE |_| 

PR_STATUS | 

LYMPH_NODES EXAMINED POSITIVE || 


| 
AGE_AT_DIAGNOSIS |_| 
COHORT | | 
GRADE | | 
TUMOR_SIZE | 
TUMOR_STAGE |_| 
TMB_NONSYNONYMOUS |_| 
OS_MONTHS | _ 
os sTaTUS SSE ee ~~04 
Ww nn Nu 7 on WwW no 
PRERYEQEPERYEASSHSSESeHNE ses 
eG Abe ee hee coe ee Qce” OlOSHEOEE 
$7 O00hRhaorhaSEEEOR SG Zee rH SSE 
Seteurt raw rraSu%®7w%®9 SEY Go VssE% 
35 f£2a¢6 FOHEFOARAY Wey 4, < Ss£€ 6 Aa 
gH¥e°- tro yos gsi ybeea a S2a 
wis zs wz ogewt4FEag fl 25260 
Oo =z ao 20 56696982 FO YB & FS? 
B. z= OS oa Ee rs) 2 < -® 
fe} S Rao} 2 5s | 2 
2 Oo gS of yy! wi fs) 
5 oo ) $ o) z 
i 0g 0 5 wi < | 
= = re) (6) ws faa] 
a 2 & @ E 
tw =o Q 
i . 
z x 
= 
> 
ao 


Fig. 10 Correlation matrix of clinical features. The correlation matrix depicts the linear correlation between all 
the pairs of attributes and ranges from —1 (perfect negative correlation) to +1 (perfect positive correlation), 
with the value of zero being no correlation between the features. Color density represents the correlation’s 
values, where the darker color implies higher values and the lighter color implies lower ones. The figure shows 
some high correlated features in the data, such as ER_STATUS and ER_IHC; AGE_AT_DIAGNOSIS and 
INFERRED_MENOPAUSAL_ STATE 


Since the next steps of the pipeline are based on survival analy- 
sis, we calculated the percentage of censored data using the lines of 
code below. Overall, there was 42.2% of the censored information. 


num_censored = df.shape[0] - df["OS_STATUS"].sum() 
print("%.1£%% of records are censored" % (num_censored/df.shape [0] #100) 


) 


Then, the follow-up time distribution of death and censored 
patients was plotted using the code below. The final chart is shown 


350 Le Minh Thao Doan et al. 


Time Distribution for Censored and Observed Events 


™@mm™_ Observed Events (Death) 


120 
- lm Censored Events 

100 A Hl | 

80 | | 
> 
rs) 
c 
® 
> 
oO 60 
ir 

40 ti 

ea 
0 50 100 150 200 250 300 350 


Time (months) 


Fig. 11 Distribution of follow-up times of censored and observed (death) events. 42.2% of the total 
observations were censored. The distribution is right-skewed and is different between censored patients 
and those who experienced the event. The censored group has more patients with longer survival times 


in Fig. 11. This step allows a further investigation into the time-to- 
event distribution for censored/non-censored patients. 


# Time Distribution of Death and Censor 

plt.figure(figsize=(9, 6)) 

val, bins, patches = plt.hist((df.query(’OS_STATUS == 1’)[’OS_MONTHS’], 
df.query(’OS_STATUS == 0’)[’OS_MONTHS’]), 

bins=30, stacked=True) 

_~ = plt.legend(patches, ["Time of Deaths", "Time of Censored"]) 

plt.title("Time Distribution of Censored and Death Patients") 


3.4.3 Plot Cox In the next step of our pipeline (step 5 in Fig. 1), the CPH model 

Proportional Hazards Model _— was fitted on the clinical data. The results were then visualized and 
reported to view the coefficients and ranges of features. Before 
running the analysis, data needed to be normalized. Min—max 
normalization, one of the most popular methods to normalize 
data, was applied. The method is based on the formula in Eq. 15, 
and the transformed data values range between 0 and 1. 


Machine Learning Methods for Survival Analysis of Breast Cancer 351 


Xscaled = ss ane (15) 
max(x) — min(x) 
# Cox survival analysis 
# Normalise data 
ss = MinMaxScaler() 
df_norm = df.drop([’OS_STATUS’, ’OS_MONTHS’], axis = 1) 
df_norm = pd.DataFrame(ss.fit_transform(df_norm), columns=df_norm. 
columns) 


df [’OS_STATUS’] 
df [’OS_MONTHS’] 


df_norm[’OS_STATUS’] 
df_norm[’OS_MONTHS’] 


The next step was to use the entire dataset to fit the Cox 
regression model, and the final results were plotted using the 
code below. 


# Build model 

# Cox proportional hazards model 

cph = CoxPHFitter () 

cph.fit(df_norm, duration_col=’OS_MONTHS’, event_col=’OS_STATUS’) 


# Plot 

plt.figure(figsize=(9, 12)) 

plt.title(’Cox Proportional Hazards Model for Clinical data’) 
cph.plot () 


# Report 
cph.print_summary (columns=["coef","exp(coef)","exp(coef) lower 95%"," 
exp(coef) upper 95%", "z", "p"], decimals=3) 


The hazard ratio of each feature and its statistical report are 
presented in Figs. 12 and 13, respectively. According to Fig. 12, 
AGE_AT_DIAGNOSIS was found as the most significant factor 
associated with the death events with the coefficient or hazard ratio 
value of 3.753. To be specific, elderly patients were 3.753 times as 
likely to die as younger ones. LYMPH_NODES_EXAMINED_- 
POSITIVE was the second critical factor among the clinical data. 
Patients with positive lymph nodes tended to have a risk of death 
1.888 times higher compared to those who did not have positive 
lymph nodes. As shown in Fig. 13, the overall C-index of this 
model is 0.685, which shows an acceptable predictive model. 

The advantage of fitting the entire dataset to a regression model 
is that more data is fitted to the CPH model, which usually 
increases the accuracy of the model. Besides, the predictive capabil- 
ities of the CPH model fitted can be evaluated to see how well the 
algorithm performs on the entire data. However, the generalization 
of the model cannot be assessed if the entire data is fitted to the 
model, and it is usually considered less trustworthy. Hence, cross- 
validation can be performed to reduce selection bias and overfit- 
ting. This approach also provides more insight into how well the 


352 Le Minh Thao Doan et al. 


Cox Proportional Hazards Model for Clinical data 


AGE_AT_DIAGNOSIS -—O—_ 
LYMPH_NODES_EXAMINED_POSITIVE 
TUMOR_SIZE 

TUMOR_STAGE 
ONCOTREE_CODE 

NPI 

THREEGENE 

CHEMOTHERAPY 

HER2_SNP6 

HER2_STATUS 

GRADE 

COHORT 

BREAST_SURGERY 

ER_IHC 

TMB_NONSYNONYMOUS 
NTCLUST 

CLAUDIN_SUBTYPE 

CANCER_ TYPE_DETAILED 
HORMONE_THERAPY 
PR_STATUS 

LATERALITY 

CELLULARITY 

RADIO_THERAPY 

ER_STATUS 
NFERRED_MENOPAUSAL_STATE 
HISTOLOGICAL_SUBTYPE 


| 


i 


wil 


My 


| 


| 
= 
© beu= 


1 2 3 4 
log(HR) (95% Cl) 


Fig. 12 Results of the Cox proportional hazards model for clinical data. The log-hazard ratio is plotted for all 
the features, with a 95% confidence interval (Cl). AGE_AT_DIAGNOSIS, LYMPH_NODES_EXAMINED_POSITIVE, 
and TUMOR_SIZE were found as the top three most significant factors associated with the death events with 
the log(HR) values of 3.753, 1.888, and 1.189, respectively. In other words, patients having higher values of 
these three predictors are more likely to have lower survival times. In contrast, the less than zero log(HR) value 
predictors, such as HISTOLOGICAL_SUBTYPE and INFERRED_MENOPAUSAL_STATE, were negatively asso- 
ciated with the death event. Patients with higher values of these factors tend to live longer compared to those 
who have low values 


model will perform on unseen data. Therefore, the next step of the 
analysis was to conduct a fivefold cross-validation to get an average 
C-index and generate more robust prediction scores. Specifically, a 
fivefold cross-validation approach splits the data into fivefold, four 
of which are used as a training set to fit the model. The fitted model 
is then evaluated on the left-out fold and a C-index is computed. 
The process is repeated for all the possible combinations of training 
and testing sets using the fivefold. The final C-index is calculated as 


Machine Learning Methods for Survival Analysis of Breast Cancer 353 


model lifelines.CoxPHFitter 

duration col ‘OS_MONTHS' 

event col ‘OS_STATUS' 

baseline estimation breslow 
number of observations 1977 
number of events observed 1143 
partial log-likelihood -7603,988 


time fit was run 2021-12-23 12:16:40 UTC 


coef exp(coef) exp(coef) lower 95% exp(coef) upper 95% z Pp 

CELLULARITY -0.107 0.899 0.749 1,079 -1.147 0.251 

HER2 SNP6 0.298 1.347 0.853 2.129 1277 = 0.201 

INFERRED MENOPAUSAL STATE -0.481 0.618 0.488 0.783 -3.999 <0.0005 

INTCLUST -0.027 0.973 0.794 1.193 -0.264 0.792 

THREEGENE 0.428 1,535 1.189 1.981 3.286 0.001 

CHEMOTHERAPY 0.324 1.383 1.106 1.728 2.845 0.004 

ERIHC 0.014 1,014 0.766 1.343 0,099 0.921 

HORMONE THERAPY  -0.061 0.940 0.810 1,091 -0.809 0.418 

CLAUDIN SUBTYPE -0.029 0.971 0.771 1.224 -0.246 0.806 

LATERALITY -0.106 0.899 0.799 1.012 -1.765 0.078 

RADIO THERAPY -0.179 0.836 0.716 0.975 -2.280 0.023 

HISTOLOGICAL SUBTYPE -0.571 0,565 0.324 0.984 -2.015 0,044 

BREAST_SURGERY 0.087 1,091 0.935 1.272 1.103 0.270 

CANCER TYPE DETAILED -0.060 0.942 0.547 1.624 -0.215 0.830 

ER_STATUS -0.355 0.701 0.527 0.933 -2.432 0.015 

HER2 STATUS 0.216 1.241 0.983 1,565 1.819 0.069 

ONCOTREE CODE 0.641 1,898 1,099 3.276 2,300 0.021 

PR_STATUS -0.071 0.932 0.809 1.073 -0.983 0.326 

LYMPH_NODES EXAMINED POSITIVE 1.888 6.607 3.451 12.648 5.699 <0.0005 

NPI 0.597 1.817 1.162 2.841 2.618 0.009 

AGE_AT_DIAGNOSIS 3.753 42.632 24.263 74.906 13.049 <0.0005 

COHORT 0.109 1.115 0.901 1.380 1.003 0.316 

GRADE 0.117 1.124 0.878 1.439 0.929 0.353 

TUMOR SIZE 1.189 3.285 1.717 6.286 3.592 <0,0005 

TUMOR STAGE 0.657 1.928 1.127 3.301 2.395 0.017 

TMB_NONSYNONYMOUS = 0.008 1.008 0.331 3.068 0.014 0.989 
Concordance 0.685 
Partial AIC 15259.976 


log-likelihood ratio test 497.155 on 26 df 


-log2(p) of Il-ratio test 291.895 


Fig. 13 Cox proportional hazards report for clinical data. The report indicates that OS_MONTHS was the 
duration variable, while OS_STATUS was the event variable used for survival analysis. The figure also reports 


354 Le Minh Thao Doan et al. 


the average of the five C-index values generated during the fivefold 
cross-validation process. The code below can be run to perform 
cross-validation and generate the average C-index (Fig. 14). 


# Cross validation (optional) 

scores = k_fold_cross_validation(cph, df_norm, ’*OS_MONTHS’, event_col=’ 
OS_STATUS’, k=5, scoring_method="concordance_index", seed=18) 

print("Average score", round(np.mean(scores) ,3)) 


3.4.4 Set Up and In order to run the ML algorithms (step 6 in Fig. 1), the following 
Evaluate Machine Learning steps were applied: 
Algorithms 


1. Data was split into training and testing sets using a stratified 
split with a ratio of 80:20. 


2. The machine learning models were trained using a fivefold 
cross-validation approach on the training set. Grid search was 
applied to autotune hyperparameters to get optimal solutions. 


3. The trained models were applied to the testing set to generate 
patient-specific predictions. 

4. Steps 1-3 were repeated 20 times on different splits of training 
and testing sets to obtain an average C-index. This process 
provides a more robust evaluation of the models since it is 
not dependent on the training—testing split. 


5. The prediction scores generated by the models were used to 
separate the patients in the testing set into higher risk and lower 
risk to investigate any significant difference in the survival rates 
of the two groups. 


The five steps presented above are further discussed and illu- 
strated below. First, we set up a seed value to ensure the reproduc- 
ibility of the results. Then, the data was arranged into a data frame 
X containing the prognostic attributes and a y data frame 


Average score 0.677 


Fig. 14 Average C-index of fivefold cross-validation for Cox proportional hazards 
models. The figure shows the average C-index generated during the fivefold 
cross-validation process. The final C-index was 0.677, which was lower than the 
C-index of 0.685 reported in Fig. 13 


< 
Fig. 13 (continued) the HR values (exp(coef)), with the corresponding 95% confident interval, and p-values of 
the clinical features. The accuracy prediction of the CPH model, i.e., the C-index, was 0.685, which indicates 
an acceptable model. Similar to the results presented in Fig. 12, AGE_AT_DIAGNOSIS, LYMPH_NODES_EX- 
AMINED_POSITIVE, and TUMOR_SIZE were identified as the top three most significant factors associated with 
the death event with a p-value less than 0.0005, and coefficient/log(HR) values of 3.753, 1.888, and 1.189, 
respectively 


Machine Learning Methods for Survival Analysis of Breast Cancer 355 


containing the target variables (survival time and status). The new 
data frames were split into training and testing sets using a random 
and stratified approach with a ratio of 80:20. 


# Set up seed and the options for the cross-validation approach 
SEED = 5 
CV = KFold(n_splits=5, shuffle=True, random_state=0) 


# Split data to prepare for ML 

X = df.drop([’OS_MONTHS’,’OS_STATUS’], axis = 1) 
df[’?OS_STATUS’] = np.where(df[’OS_STATUS’] == 1, True, False) 
y = df([’OS_STATUS’ ,’O0S_MONTHS’]].to_records (index=False) 


# Split the data set into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size 
=0.2, stratify=y[’OS_STATUS’],random_state=SEED) 


Once data was prepared, the ML models were applied by 
defining a function to train and evaluate the procedure. We used 
grid search with fivefold cross-validation to train and tune the 
hyperparameters for each estimator. Then, we applied the optimal 
algorithms to generate the final prediction on the testing set. The 
function returns the optimal model and C-index. 


# Build model 

# Define a function for grid search to tune training model 

# and predict the results 

def grid_search(estimator, param, X_train, y_train, X_test, y_test, CV) 


# Define Grid Search 
gcv = GridSearchCV(estimator, param_grid=param, cv=CV, 
n_jobs=-1).fit(X_train, y_train) 


# Find best model 
model = gcv.best_estimator_ 
print (model) 


# Generate predictions 

prediction = model.predict (X_test) 

result = concordance_index_censored(y_test["OS_STATUS"], y_test[" 
OS_MONTHS"], prediction) 

print(’C-index for test set (Hold out):’, result [0]) 


return [model, prediction] 


Next, to avoid bias in our final evaluation, we ran each ML 
model 20 times. By defining the below function, the number of 


356 Le Minh Thao Doan et al. 


re-run times can be easily changed by modifying the value of n. The 
function below randomly generates » different seeds, one for each 
iteration. Training and testing set splitting is performed in each 
loop, the data is fitted to identify the optimal algorithm, and the 
final model is evaluated on unseen data. By doing so, we randomly 
created m different testing sets and evaluated each algorithm 
n times. Finally, the average results of the ” runs were calculated 
and reported. 


# Re-run experiment 20 times 

def c_index(model, X, y, n=20): 
np.random. seed (1) 
seeds = np.random.permutation(1000) [:n] 


# Train and evaluate model with 20 times 
cindex_score = [] 
predict_list = [] 


for s in seeds: 

X_trn, X_test, y_trn, y_test = train_test_split(X, y, test_size 
=0.2, stratify=y[’OS_STATUS’], random_state=s) 

model.fit(X_trn, y_trn) 

prediction = model.predict (X_test) 

predict_list.append (prediction) 

result = concordance_index_censored(y_test ["OS_STATUS"],y_test[ 
"OS_MONTHS"], prediction) 


cindex_score.append(round (result [0] ,3)) 


print(’Average C-index for {} runs’.format(n), np.mean(cindex_score 


)) 


return [cindex_score, predict_list] 


After defining the two functions above for the ML process, we 
designed the experiment pipeline by specifying the algorithms and 
establishing their hyperparameters. Before applying the algorithms, 
all the data had to be normalized using min—max normalization. 
Different values of ridge regression penalty were tested to tune the 
CPH model (the values varied between 0.001 and 100, as shown in 
Table 1). 


# Define the Pipeline and hyperparameter 
# CoxPHSurvivalAnalysis 
pipe_cox = Pipeline([(’scaler’, MinMaxScaler()),(’model’, 


CoxPHSurvivalAnalysis())]) 
param_cox ={’scaler’: [MinMaxScaler()], 
"model__alpha": [0.001, 0.01, 0.1, 1, 10, 100]} 


Table 1 


Machine Learning Methods for Survival Analysis of Breast Cancer 357 


Hyperparameters of the models. Each method was parametrized and trained using a fivefold cross- 
validation approach. Grid search was used with different hyperparameters while maximizing the 


C-index 

Models Hyperparameters name Hyperparameters set Selected value 

CPH Ridge regression parameter [0.001, 0.01, 0.1, 1, 10, 100] 1 

RSF max_features sqrt sqrt 
max_depth 8 8 
min_samples_leaf [50, 100] 50 
min_samples_split 100 100 
n_estimators 500 500 

GBS learning_rate (0.01, 0.1, 1] 0.1 
n_estimators [200, 500, 800, 1000] 200 

SSVM Optimizer [avltree, rbtree, simple | avltree 
max_iter [500, 5000] 500 


Then, we set up the hyperparameters for the three ML-based 
algorithms, namely RSF, GBS, and SSVM. In the RSF algorithm, 
the m_estimators and the max_depth hyperparameters can be set to 
specify the number of trees and the maximum depth of the tree in 
the forest, while the min_samples_leaf and the min_samples_split 
parameters can be set to specify the minimum number of samples 
required to be at a leaf node, and the minimum number of samples 
required to split an internal node, respectively. The deeper the tree 
grows in the forest, the more complex the model, which can easily 
lead to overfitting and increased computational complexity. In 
order to avoid these problems, a predefined max_depth parameter 
can be set; otherwise, the trees are grown until each leaf contains 
less than min_samples_split samples. The max_features hyperpara- 
meter can also be defined to set the number of features to consider 
when looking for the best split. By default, the algorithm considers 
all the features and selects the one with the optimal metric to 
perform the split. If the max_features parameter is set equal to 
sqrt, the maximum number of features considered at each split is 
equal to the square root of the total number of features in the 
dataset. Reducing the number of features can save computational 
resources, increase the stability of the forest, reduce variance, and 
overfitting. 


358 Le Minh Thao Doan et al. 


# Random Survival Forests 
pipe_rsf = Pipeline([(’scaler’, MinMaxScaler()),(’model’, 
RandomSurvivalForest())]) 

param_rsf ={’scaler’: [MinMaxScaler()], 
>model__random_state’: [SEED], 
>model__max_features’: [’sqrt’], 
?>model__max_depth’: [8], 
»>model__min_samples_leaf’: [50, 100], 
>model__min_samples_split’: [100], 
»>model__n_estimators’:[500]} 


In the GBS algorithm, the _estimators parameter can be used 
to set the number of trees to generate, while the learning_rate 
parameter can be set to regulate the learning rate that shrinks the 
contribution of each tree. The GBS model is robust to overfitting, 
so a higher value of the m_estimators parameter often results in 
better performance. However, there is a trade-off between 1_esti- 
mators and learning_rate. Thus, different combinations of the list 
of values of the above hyperparameters were tried in the tuning 
phase. 


# Gradient Boost Survival 
pipe_gbs = Pipeline([(’scaler’, MinMaxScaler()),(’model’, 
GradientBoostingSurvivalAnalysis ())]) 
param_gbs ={’scaler’: [MinMaxScaler()], 
»model__random_state’: [SEED], 
>model__learning_rate’: [0.01, 0.1, 1], 
>model__n_estimators’:[200, 500, 800, 1000]} 


The hyperparameters defined for the SSVM algorithm included 
the optimizer, which refers to the optimization techniques, such as 
the AVL tree (avltree), the red-black tree (rbtree), and the simple 
methods. The max_iter parameter can be set to define the maxi- 
mum number of iterations to perform in the Newton optimization. 
These hyperparameters are necessary to design an effective and 
efficient SSVM model. A summary of the hyperparameters tuned 
for each model using grid search and their final values are presented 


in Table 1. 
# Survival SVM 
pipe_svm = Pipeline([(’scaler’, MinMaxScaler()),(’model’, 
FastSurvivalSVM())]) 
param_svm ={’scaler’: [MinMaxScaler()], 


?model__random_state’: [SEED], 
>model__max_iter’: [500, 5000], 
?model__optimizer’:[’avltree’, ’rbtree’,’simple’]} 


Machine Learning Methods for Survival Analysis of Breast Cancer 359 


Once data preparation and models design were completed, an 
estimator’s dictionary, containing the pairs of names of the algo- 
rithms, and their corresponding pipelines were generated to sup- 
port looping the same procedure for each model. 


# Estimator list: 


estimator_list = {’Cox Regression’:[pipe_cox, param_cox ], 
>Random Forest Survival’:[pipe_rsf, param_rsf], 


*Gradient Boosting Survival’: 
[pipe_gbs, param_gbs], 
?SVM Survival’: [pipe_svm, param_svm]} 


Since the training and testing phases for each algorithm follow 
the same approach, we put them into a list of estimators and iterate 
the same procedure over this list. The output of the procedure 
displays the optimal algorithm, the holdout-test results, and the 
average C-index for each model, as shown in Fig. 15. The results 
show that the average C-index over 20 runs of the three ML-based 
models outperformed CPH, a well-known statistical approach for 
survival analysis. SSVM had the highest average C-index with a 
value of 0.688, followed by RSF, GBS, and CPH with C-indices 
of 0.685, 0.683, and 0.678, respectively. 


model_list = [] 
pred_list = [] 
c_index_list = [] 
pred_list_n = [] 


for model_name, index in estimator_list.items(): 
print (’\n’,model_name) 


estimator = index[0] 

param = index[1] 

outcome = grid_search(estimator, param, X_train, y_train, 
X_test, y_test, CV) 

model = outcome [0] 


pred_list.append (outcome [1] ) 


# Run model n times to check C-index 
score, pre = c_index(model, X, y, n=20) 
c_index_list.append (score) 
pred_list_n. append (pre) 


Boxplots were then used to visualize and compare the distribu- 
tions of C-index values for the 20 runs for each model (Fig. 16). On 
average, SSVM had the highest performance, followed by RSF, 
GBS, and CPH. 


360 Le Minh Thao Doan et al. 


Cox Regression 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', CoxPHSurvivalAnalysis(alpha=0.1))]) 
C-index for test set (Hold out): 0.660715999616086 
Average C-index for 20 runs 0.6778500000000001 


Random Forest Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 

RandomSurvivalForest (max_depth=8, max_features='sqrt', 
min_samples_leaf=50, 
min_samples_split=100, n_estimators=500, 
random_state=5))]) 

C-index for test set (Hold out): 0.6678184086764565 
Average C-index for 20 runs 0.6854499999999998 


Gradient Boosting Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
GradientBoostingSurvivalAnalysis(n_estimators=200, 
random_state=5))]) 
C-index for test set (Hold out): 0.667875995776946 
Average C-index for 20 runs 0.6825 


SVM Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
FastSurvivalSVM(max_iter=500, optimizer='avltree', 
random_state=5) )]) 
C-index for test set (Hold out): 0.6747288607351953 
Average C-index for 20 runs 0.68755 


Fig. 15 CPH and ML models’ results for clinical data. The selected hyperparameters, initial test result, and the 
average C-index of each model are displayed in the outcome. Overall, the average performance over 20 runs 
of the three ML-based models outperformed the CPH model, a well-known statistical approach for survival 
analysis. SSVM had the highest average C-index with a value of 0.688, followed by RSF, GBS, and CPH 


# Visualise results 
mame = [’CPH’, ’RSF’, °’GBS’, ’SSVM?] 
cv_res = [] 


for i in range(0,4): 
for c in c_index_list [i]: 
cv_res.append([name [i] ,c]) 


c_plot = pd.DataFrame(cv_res, columns=[’Model Name’, ’C-index’]) 
ax = sns.boxplot(x="Model Name", y="C-index", data=c_plot) 
plt.title(’C-index for 20 runs’) 


Machine Learning Methods for Survival Analysis of Breast Cancer 361 


C-index for 20 runs 


0.71 

0.70 

0.69 
5 0.68 
= 
3) 

0.67 

¢ 
0.66 
¢ 
0.65 ; 
¢ 
CPH RSF GBS SSVM 


Model Name 


Fig. 16 C-index comparisons for Experiment 1. Boxplots of C-index results of clinical data using CPH, RSF, 
GBS, and SSVM. The experiments were replicated 20 times. In each experiment, the data was randomly 
divided into training and testing sets with a ratio of 80:20 while guaranteeing the same censoring percentage 
on each subset of data. SSVM was found to have the highest median C-index, followed by RSF, GBS, and CPH 


The patients in the testing set were then ranked by their pre- 
dicted risk score and split into two equal-sized groups using the 
median risk score. High-risk groups included patients with prog- 
nostic risk scores greater than or equal to the median value, while 
low-risk groups included those with prognostic risk scores below 
the median value. 

In the next step of the pipeline (step 6 in Fig. 1), Kaplan—Meier 
plot and log-rank tests were conducted for all the models to statis- 
tically investigate the differences between the survival curves of the 
two groups. Figure 17 reveals that the lower-risk patients, or those 
with lower predicted risk scores, were associated with better sur- 
vival outcomes (i.e., higher survival probability). Besides, there 
were statistically significant differences in the survival distributions 
of high-risk and low-risk patients for all four models ( p-values < 
0.0001). Log-rank test was used to assess the statistical significance 
and compute the p-value. This analysis shows that the clinical 
factors can be used to split the patients into risk groups based on 
their predicted scores. GBS was the best model in prognostic 
diagnosis with a p-value of 5.918E-12. 


362 Le Minh Thao Doan et al. 


KM Curves CPH 


1.0 
0.8 
co 
2 
2 
a 
w 0.6 
fe) 
2 
a 
s 
8 04 
a 
7 
Ww 
0.2 p-value = 1.903e-10 
0.0 1 t ' 1 ' 
0 50 100 150 200 
Time (months) 
HR 
Atrisk 198 153 103 58 24 
Censored ie) 6 21 32 42 
Events 0 39 74 108 132 
LR 
Atrisk 198 175 139 87 52 
Censored (0) 4 20 55 74 
Events 0 19 39 56 72 
KM Curves GBS 
1.0 
0.8 
o 
2 
2 
a 
we 0.6 
ce) 
> 
fal 
s 0.4 
6 0. 
a 
ay 
n 
uw 
0.2 p-value = 5.918e-12 
0.0 1 ' ' 1 ' 
0 50 100 150 200 
Time (months) 
HR 
Atrisk 198 150 100 56 26 
Censored le) 6 18 28 36 
Events 0 42 80 114 136 
LR 
Atrisk 198 178 142 89 50 
Censored 0 4 23 59 80 
Events 0 16 33 50 68 


1.0 
— HR 
==iLR 
0.8 
cc 
= 
2 
a 
w 0.6 
i) 
2 
a 
s 
is] 0.4 
a 
a 
Ww 
0.2 
1 i 0.0 
250 300 
5 4 At risk 
51 54 Censored 
142 143 Events 
16 0 At risk 
101 412 Censored 
81 86 Events 
1.0 
— HR 
= LR 
0.8 
o 
2 
2 
a 
= 0.6 
5 
2 
i} 
B04 
6 0. 
a 
o 
Ww 
0.2 
' 1 0.0 
250 300 
5 1 
48 51 At risk 
145 146 Censored 
Events 
16 0 
104 115 At risk 
78 83 Censored 
Events 


p-value = 9.473e-12 


t 
50 


50 


15 


p-value = 2.961e-10 


KM Curves RSF 


1 1 1 
100 150 200 


Time (months) 


100 57 27 
17 28 35 
81 113 136 

142 88 49 
24 59 81 
32 51 68 


KM Curves SSVM 


100 150 200 


Time (months) 


103 56 26 
18 32 41 
7 110 131 

139 89 50 
23 55 75 
36 54 73 


— HR 
== ‘LR 


7 0 
46 51 
145 147 
14 1 
106 115 
78 82 


6 1 
51 55 
141 142 
15 0 
101 111 
82 87 


Fig. 17 Kaplan—Meier curves to compare the high-risk and low-risk breast cancer groups, stratified by the 
predicted survival risk scores generated by the four models. The low-risk group (n = 198) included patients 
with predicted risk scores above the median value, while the high-risk group (n = 198) comprised those less 
than the median value. Also, the p-value from the log-rank test was calculated to determine the statistical 
significance of the difference in survival functions between the two groups. The figure shows statistically 
significant differences in survival distributions between the two groups for all four models with a p-value lower 


than 0.0001 


fig, ax = plt.subplots(2,2,figsize=(12,12)) 


k 


for pred in pred_list: 


Machine Learning Methods for Survival Analysis of Breast Cancer 363 


= 0 


dfi = X_test.reset_index (drop=True) 
risk =[] 

y-pred = pred 

med = np.median(y_pred) 

r = np.where(y_pred >= med, 1, 0) 
dfi(’Risk’?] =r 

print (df1.shape) 

i= dtl [7 Risk?)) ==) 1 


df_y = pd.DataFrame(y_test) 

df_y[’OS_STATUS’] = np.where(df_y[’OS_STATUS’] == True, 1, 0) 
dfi[’OS_STATUS’]= df_y[’OS_STATUS’] 

dfi[’OS_MONTHS’]= df_y[’OS_MONTHS’] 

T_hr, E_hr =df1.loc[ix][’OS_MONTHS’],df1.loc[ix][’OS_STATUS’] 
T_lr, E_lr = dfi.loc[~ix][’?OS_MONTHS’], df1.loc[~ix][’?OS_STATUS°?] 


# Set-up plots 
k+=1 
plt.subplot (2,2,k) 


# Fit survival curves 

kmf_hr = KaplanMeierFitter () 

ax = kmf_hr.fit(T_hr, E_hr, label=’HR’).plot_survival_function() 
kmf_lr = KaplanMeierFitter () 

ax = kmf_lr.fit(T_lr, E_lr, label=’LR’).plot_survival_function () 
add_at_risk_counts(kmf_lr, kmf_lr) 


# Format graph 

plt.ylim(0,1); 

ax.set_xlabel(’Timeline (months)’,fontsize=’large’) 
ax.set_ylabel(’Percentage of Population Alive’,fontsize=’large’) 


# Calculate p-value 

res = logrank_test(T_hr, T_lr, event_observed_A=E_hr, 
event_observed_B=E_lr, alpha=.95) 

print(’\nModel’, name[k-1]) 

res.print_summary () 


# Locate the label at the ist out of 9 tick marks 
xloc = max(np.max(T_hr),np.max(T_lr)) / 10 
ax.text(xloc,.2, res.p_value ,fontsize=15) 
ax.set_title(’KM Curves {}’ .format (name[k-1i])) 


plt.tight_layout () 


3.4.5 Interpret Model Clinicians can rely on a predictive model when its outcome can be 


interpreted. This is especially crucial for the healthcare domain, 
where every decision relates to human life. Interpretability of ML 


364 Le Minh Thao Doan et al. 


can be defined as the extent to which an individual can understand 
the cause of the predicted outcome [72]. Shapely Additive Expla- 
nations values (SHAP) [47] can be applied to interpret the results 
of the ML models run in the previous section. SHAP values repre- 
sent a unified approach to interpret predicted outcomes made by 
complex ML algorithms. This explainable approach has gained 
much attention from researchers, and it has been increasingly 
applied in many fields, including medical and oncology applications 
[Se #2 | 

As shown in step 7 in Fig. 1, SHAP values can be used to 
measure the importance of the features by calculating the impact of 
each estimator on the model prediction. In other words, it mainly 
focuses on explaining the importance or the weight of a specific 
feature on the model prediction. Each patient is represented by one 
data point with positive or negative values indicating the direction 
of the impact. The higher the SHAP value associated with the 
patient, the higher the mortality risk. For example, for the age 
feature, a 20-year-old patient might have a negative SHAP value 
of —1.5, meaning this young patient has a better prognosis and 
would live longer. In contrast, a 70-year-old patient might have a 
positive SHAP value of 1.0, indicating that this patient faces a 
higher mortality risk. Hence, age, in this case, is an important 
feature significantly influencing the survival rate of the patient. 
The code below can be run to perform the SHAP interpretability 
analysis. The run time for CPH, GBS, and SSVM is about 30 min 
per model, while RSF requires about 16 h to generate the SHAP 
plot. 


# Initialize JS For Plot 
shap.initjs() 


for i in range(0,4): 
print(’\nModel’, name[i]) 
m = model_list [i] [1] 
m.fit(X_train,y_train) 
explainer = shap.Explainer(m.predict, X_train, feature_names= 
X_train.columns) 
shaps = explainer(X_test) 
shap.summary_plot(shaps, X_test) 


Figure 18 shows the SHAP summary plot for clinical data, 
where a single patient is represented by one data point for each 
feature. The x-axis represents the effect of the features on the 
prediction of the algorithm for a specific observation in the testing 
set, while the y-axis reports the top prognostic predictors in des- 
cending order based on their importance ranking. AGE_AT- 
DIAGNOSIS was found consistently among the four models as 
the top significant factor impacting the survival risk. Specifically, 
higher age is associated with higher mortality risk. The SHAP 


Machine Learning Methods for Survival Analysis of Breast Cancer 365 


a 
(a) CPH (b) RSF 
ee eo ae AGE_AT DUAGN a ere « 
‘ N 7 
c He 2a —e 
I}re >> 
me a 
BIYPE na 1 -? 
‘i - 
ti e 
* 
I > 
ul ] 
ul + 
{° + 
e 4 
4 4 
(c) GBS (d) SSVM 
Spree tee a ee PDE OPY ~ 
¢-tp—-~-- - - $i—+ 
ee set bee 
+ —- a | 
+ I} ° 
~— it 
oe pe. — 
> V4 
> — Fg ss waistovosica_suetvpe = ten ag 1 
- 1t 
+ 1s 
+> vel 
>-- coffees 
+-: ~~: 
. +e 
-+- It 
ea “Hl 
La I: 
-+ - 
“4 ® 


neni) 


Fig. 18 SHAP summary plot for clinical data for (a) CPH, (b) RSF, (c) GBS, and (d) SSVM models. For each 
Clinical feature, a single patient is represented by one point. The y-axis lists the top prognostic features and 
presents them in descending order based on their importance ranking provided by the mean of their absolute 
SHAP values. The x-axis reports the SHAP value indicating the impact of the feature on the prediction of the 
algorithm for a specific observation in the testing set. The color represents the value of the feature. The higher 
the SHAP value the patient had, the higher the risk of death or the shorter survival time. AGE_AT_DIAGNOSIS 
was found consistently among the four models as the top significant factors impacting the survival risk 


values and feature ranking are slightly different across the models. 
For example, according to CPH and SSVM, INFERRED_MENO- 
PAUSAL_STATE was the second most important feature asso- 
ciated with survival risk, while NPI was the second most 
important feature in RSF and GBS. In contrast, LATERALITY 
and ER_STATUS had the lowest impact on the outcome of the 
model as they had convergent data points. Hence, we can get a 
holistic picture of the model prediction from the SHAP plots as 
they illustrate the importance of features and their corresponding 
impact on the outcome while determining the value distribution of 
those features in the test set. 


366 Le Minh Thao Doan et al. 


3.5 Experiment 2: 
Transcriptomic Data 


3.5.1 Load Data 


# Load data 


The second experiment presented in our work consists in conduct- 
ing survival analysis on transcriptomic data following similar steps 
presented in Subheading 3.4. Since transcriptomic data is high- 
dimensional data, feature extraction is highly recommended to 
avoid overfitting and save computational time and resources. 50 fea- 
tures were extracted from the transcriptomic data and used to train 
and evaluate the models. To quickly reproduce the experiment and 
for a more straightforward presentation, we divided this experi- 
ment into two notebooks. The first one includes the preprocessing 
and feature selection steps, where the number of extracted features 
can be easily changed. The training and evaluation of the models 
are performed in the second notebook, where the data extracted 
from the first workbook are explored and used for the ML models. 


First, transcriptomic data was loaded. As it omitted OS_MONTHS 
and OS_STATUS, clinical data was also required to be loaded into 
the data frame to extract the relevant information about survival 
time and status. 


filei=pd.read_csv(’Data/data_clinical_patient.csv’) 


file2=pd.read_csv(’Data/data_mRNA_median_all_sample_Zscores.csv’) 


Then, the first five rows and data information are displayed, as 
shown in Fig. 19. 


# Have a quick look on data 


file2.info() 


file2.head() 


<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 24368 entries, 0 to 24367 
Columns: 1906 entries, Hugo_Symbol to MB-4313 
dtypes: float64(1905), object(1) 

memory usage: 354.4+ MB 


Hugo_Symbol Entrez_Gene_Id MB-0362 MB-0346 MB-0386 MB-0574 MB-0503 


0 RERE 
1 RNF165 
2 CD049690 
3 BC033982 
4 PHF7 


473.0 -0.7082 1.2179 0.0168 -0.4248 0.4916 
494470.0 -0.4419 0.4140 -0.6843 -1.1139 -0.6875 
NaN 0.2236 0.2255 0.5691 0.3545 0.7865 

NaN -2.1485 0.4763 -0.2446 0.2618 -0.2695 
51533.0 -0.3220 -1.0921 0.2830 -0.2864 0.0772 


Fig. 19 Output of transcriptomic data information. The figure presents an overview of the transcriptomic data 
frame, including the total number of entries and the number of columns. There are 24,368 genomic entries 
and 1906 patient columns in the data. As the dataset contains too many columns, the output shows only the 
first 7 columns, while the first five rows are extracted 


Machine Learning Methods for Survival Analysis of Breast Cancer 367 


The transcriptomic data included 1906 columns and 24,368 
rows. The first two columns report the gene identifiers in different 
formats, namely Hugo_Symbol and Entrez_Gene_Id, while the rest 
of the columns report the data for the 1904 patients. 


3.5.2 Preprocess Data Hugo Symbols (Hugo_Symbol column) were used in the transcrip- 
tomic data as a stable identifier for genes. Therefore, the Entrez_- 
Gene_Id column was removed from the data frame. 


# Drop unused column 
file2 = file2.drop(’Entrez_Gene_Id’, axis=1) 


We filtered the non-blank values in the Hugo_Symbol column. 
Next, missing and duplicate values were checked and removed 
(step 3 in Fig. 1). A different approach for dealing with duplicates 
in transcriptomic data consists in replacing all the duplicates for 
each gene with their average value. For the sake of simplicity, in this 
tutorial, we decided to simply remove the duplicate values. 


# Drop NA in GeneID 
file2 = file2[file2[’Hugo_Symbol’].notna()] 


# Check null in GeneID columns 
file2[’Hugo_Symbol’].isnull().sum() 


# Check duplicate values 
print(’The number of duplicate values of Hugo_Symbol in data:’, file2[’ 
Hugo_Symbol’].duplicated().sum()) 


# Drop duplicate values for Gene ID 

file2 = file2.drop_duplicates (subset=[’Hugo_Symbol’]) 

print(’After pre-processing, the number of duplicate values of 
Hugo_Symbol:’, file2[’Hugo_Symbol’].duplicated().sum()) 

print(’Shape of Gene data:’, file2.shape) 


Figure 20 shows that initially there were 192 repeated Hugo_- 
Symbol values in our data. After preprocessing, there were no dupli- 
cates, and the final data frame had 24,176 rows (i.e., Hugo 
symbols) and 1905 columns (i.e., 1904 patients and one Hugo 
symbol ID). After eliminating duplicate values, the data frame was 
readily transposed to allow the matching of the Patient IDs in the 
transcriptomic data with those in the clinical data. As shown in 
Fig. 20, the new shape of the data frame was 1904 rows (i.e., 
patients) and 24,177 columns (i.e., 24,176 Hugo symbols and 
one Patient ID column). The first three rows in the new data 
frame were extracted to check the format of the transposed matrix. 


368 Le Minh Thao Doan et al. 


The number of duplicate values of Hugo_Symbol in data: 192 

After pre-processing, the number of duplicate values of Hugo_Symbol in data: 0 
Shape of Gene data: (24176, 1905) 

New shape of Gene data: (1904, 24177) 


PATIENT _ID RERE RNF165 CD049690 BC033982 PHF7 CIDEA PAPD4 Al082173 SLC17A3 


0 MB-0362 -0.7082 -0.4419 0.2236 -2.1485 -0.3220 0.0543 -0.7462 -0.4045 0.7777 
1 MB-0346 ~=—-1.2179 0.4140 0.2255 0.4763 -1.0921 -1.1534 0.0709 0.5118 -0.5187 
2 MB-0386 0.0168 -0.6843 0.5691 -0.2446 0.2830 2.9594 -0.6240 -0.3849 0.6866 


3 rows x 24177 columns 


Fig. 20 Output of transcriptomic data preprocessing. The dataset includes 192 duplicate Hugo symbols. After 
removing duplicate values in the Hugo_Symbo!l column, there are 24,176 rows (i.e., Hugo symbols) and 1905 
columns (i.e., 1904 patients and one Hugo symbol column) in the dataset. We transposed the data frame 
before merging it with the clinical data to retrieve the OS_IMONTHS and OS_STATUS columns, needed for 
performing survival analysis. After transposing the data, the new table contained 1904 patient rows and 
24,177 columns (24,176 Hugo symbols and one patient IDs column). The first three rows of the table are 
shown in the figure 


# Tranpose Patient ID to rows in order to match two data 
file2 = file2.set_index(’Hugo_Symbol’).T.rename_axis(’PATIENT_ID’). 


rename_axis(None, axis=1).reset_index() 
print(’New shape of Gene data:’, file2.shape) 
file2.head(3) 


The new data frame was then merged with the OS_MONTHS 
and OS_STATUS columns from the clinical data based on the Patient 
ID information. The resulting data frame only comprised those 
matched patients between the transcriptomic and clinical tables. 


# Merge gene data with OS time and status 
data = pd.merge(filei[[’PATIENT_ID’,’OS_MONTHS’,’OS_STATUS’]],file2, 
how="inner", on=["PATIENT_ID"]) 


# Have a quick look at data 
data. head () 


In the next step (step 3 in Fig. 1), we checked if the new data 
frame contained any missing values. 


# Check missing values 
print(’Total missing value in the dataset:’, data.isnull().sum().sum()) 


cols_missvalue = data.columns[data.isnull().sum() >0] 
print(’List columns having missing data:’, cols_missvalue) 


According to the output presented in Fig. 21, there were 
10 missing values in the entire data. We replaced those with their 


Machine Learning Methods for Survival Analysis of Breast Cancer 369 


Total missing values in the dataset: 10 
List columns having missing data: Index({'TMPRSS7', 'SLC25A19', 'IDO1', 
"CSNK2A1', 'BAMBI', 'MRPL24', 'AK127905', 
'FAM71A'], 
dtype='object') 
After preprocessing, the number of missing values: 0 


Fig. 21 Output of the processing of missing values. There are 10 missing values in the entire data. The Hugo ID 
columns with missing values are TMPRSS7, SLC25A179, IDO1, CSNK2A1, BAMBI, MRPL24, AK127905, and 
FAM71A. As the missing values in the data are numeric, we replace them with their average values in the 
corresponding columns 


average values in the corresponding columns. Several techniques 
have been proposed to handle missing values in transcriptomic 
data, such as k-nearest neighbors imputation, Gaussian mixture 
clustering imputation, and weighted least square imputation [74— 
76]. For simplicity, we replaced the missing values with their aver- 
age values in this tutorial. 


# Deal with missing values 

# Replace missing values with average values 

data[cols_missvalue] = data[cols_missvalue].fillna(data[cols_missvalue 
].mean()) 


# Check missing values again 
print(’After preprocessing, the number of missing values:’, data.isna() 
.sum() .sum() ) 


3.5.3 Feature Selection mRMR was applied to extract the most relevant features from the 
Hugo_Symbol column to be used for the ML models (step 4 in 
Fig. 1). Before employing mRMR, it is recommended to normalize 
the data to boost the performance of the algorithm and save 
computational time. Hence, after removing the survival and patient 
ID information, min—max normalization was implemented to nor- 
malize the transcriptomic data. 


# Normalise data 
ss = MinMaxScaler() 


X_norm = data.drop([’OS_STATUS’, ’OS_MONTHS’,’PATIENT_ID’], axis = 1) 
X_norm pd.DataFrame(ss.fit_transform(X_norm), columns=X_norm. columns) 


For the mRMR algorithm, the number of selected features can 
be easily changed by modifying the value of Kin the code below. In 
this experiment, we extracted 50 features (K = 50) to demonstrate 
how to run the pipeline. The more features are removed, the longer 
is the time required by the mRMR algorithm. For 50 features, the 
model took around 30-45 min to run. After the features were 


370 Le Minh Thao Doan et al. 


extracted, the new data frame was saved to a new CSV file (Gen- 
e_ MRMR_50.csv); this file will be required for the ML process in 
the second notebook. 


# Features extraction 
# Select features using mRMR 
y_mrmr = data[’OS_MONTHS’] 


features_selected = mrmr_classif(X_norm, y_mrmr, K = 50) 
X_mrmr = data[features_selected] 


# Save to csv file 
df_mrmr = X_mrmr 
df_mrmr[’PATIENT_ID’] = data[’PATIENT_ID’] 


df_mrmr.to_csv(’Data/GeneID_MRMR_50.csv’, index=False) 


For easier processing, a new Jupyter notebook was created, and 
the extracted data was loaded to carry on the next steps of the 
analysis. Then, the transcriptomic data of the 50 extracted features 
was merged with the clinical data by Patient_ID. After merging, the 
PATIENT ID column was not relevant for the ML analysis, and it 
was removed from the data frame. Before analyzing the data, the 
OS_STATUS column was encoded to numeric values. The final data 
frame included 1904 rows (patients) and 52 columns (the survival 
time and status of patients, and 50 genes), as shown in Fig. 22. 


# Load data 
filel = pd.read_csv(’Data/data_clinical_patient.csv’) 
file2 = pd.read_csv(’Data/GeneID_MRMR_50.csv’) 


# Merge gene data with OS time and status 
data = pd.merge(filei[[’PATIENT_ID’,’OS_MONTHS’ ,’OS_STATUS’]],file2, 
how="inner", on=["PATIENT_ID"]) 


# Preprocess data 

# Drop unused columns 

drop_list = [’PATIENT_ID’] 

df = data.drop(drop_list, axis=1) 

print(’After the first preprocessing, the shape of data is’, df.shape) 
# Encode OS status to dummy 

df[’OS_STATUS’] = np.where(df[’OS_STATUS’] == °1:DECEASED’, 1, 0) 


After cleaning, the shape of data is (1904, 52) 
Missing value number: 0 


Fig. 22 Output of preprocessing step. The figure shows that there were 1904 
rows and 52 columns in the final data frame. The columns in the final dataset 
comprised OS_MONTHS, OS_STATUS, and 50 transcriptomic extracted feature 
columns. No missing values were found in the data 


Machine Learning Methods for Survival Analysis of Breast Cancer 371 
0.8 


Correlation matrix 
OS_MONTHS mi, 
FGF3 fll 
ATP9A | log 
KPRP | 
AICDA | 
DA028946 | 
PNMA5 | 
MYL1 a Le 
C17orf98 | 
INSM2 a 
GIF | 
OTOS | 
AI382167 | 
RHAG | -0.2 
CT45A5 | 
EMILIN3 | 
SLC14A1 | 
BX096436 | 
ERAS | Leo 
OR6C4 | 
AA625691 | 
GFAP | 
AW296252 | 
ZP4 | 
BCHE | 
TMED6 


FGF3 
ATP9A 
KPRP 
AICDA 
DA028946 
PNMA5 
MYL1 
C17orf98 
INSM2 
GIF 
OTOS 
Al382167 
RHAG 
CT45A5 
EMILIN3 
SLC14A1 
BX096436 
RA: 

OR6C4 
AA625691 
GFAP 
AW296252 

TMED6 a 

| 


OS_MONTHS 


Fig. 23 Correlation matrix of 50 gene expression features. The correlation matrix depicts the linear correlation 
between all the pairs of attributes and ranges from —1 (perfect negative correlation) to +1 (perfect positive 
correlation), with the value of zero representing no correlation between the features. Color density represents 
the values of the correlation, where a darker color implies higher values and a lighter color implies the lower 
ones. There were no highly correlated features observed in the data 


Once the preprocessing steps (step 3 in Fig. 1) on the data were 
completed, we conducted a correlation analysis and plotted the 
followed-up survival-time distribution to investigate the data, as 
displayed in Figs. 23 and 24. No high correlated features were 
found in the selected transcriptomic data. 


372 Le Minh Thao Doan et al. 


Time Distribution for Censored and Observed Events 


120 mmm Observed Events (Death) 
| mmm Censored Events 

100 i E 

80 | | 
> 
) 
c 
o 
S 60 
2 
Te 

40 il 

| yr y 

| | 
0 50 100 150 200 250 300 350 


Time (months) 


Fig. 24 Distribution of follow-up times of censored and observed (death) events associated with the 
transcriptomic data selected by mRMR. The data contained 42.1% of the censored observations. The 
distribution is right-skewed, and it is different between censored patients and those who experienced the 
event. The censored group has more patients with longer survival times 


# Correlation analysis 
colormap = plt.cm.Reds 
plt.figure(figsize=(8,8) ) 
sns.heatmap(df.corr() ,linewidths=0.1,vmax=0.8, 
square=True, cmap = colormap, linecolor=’white’) 
plt.title(’Correlation matrix’, fontsize=14) 
plt.show() 


# Time Distribution of Death and Censored 

num_censored = df.shape[0] - df["OS_STATUS"].sum() 

print("\%.1f£\%\% of records are censored" \% (num_censored/df.shape 
[0] *100) ) 


plt.figure(figsize=(10, 6)) 
val, bins, patches = 
plt.hist((df.query(’OS_STATUS == 1’)[’OS_MONTHS’], 
df.query(’OS_STATUS == 0°’)[’OS_MONTHS’]), 
bins=30, stacked=True) 
. = plt.legend(patches, ["Time of Death", "Time of Censored"]) 
plt.title("Time Distribution of Censored and Death Patients") 


3.5.4 Plot Cox 
Proportional Hazards Mode! 


3.5.5 Set Up and 
Evaluate Machine Learning 
Algorithms 


Machine Learning Methods for Survival Analysis of Breast Cancer 373 


The rest of Experiment 2, including plotting the CPH model, 
preparing and evaluating the ML algorithms, and interpreting the 
results, was set up as previously done in Experiment 1 in Subhead- 
ings 3.4.3, 3.4.4, and 3.4.5. Since the code to run the analysis in 
the sections below is the same as the one shown in Subheadings 
3.4.3, 3.4.4, and 3.4.5, we do not repeat it below. However, the 
complete notebook can be accessed at https://github.com/ 
Angione-Lab/survival_analysis_tutorial. 


Before fitting the CPH model (step 5 in Fig. 1), the data was 
normalized applying the min-max method (as shown in Subhead- 
ing 3.4.3). Then, the log(HR) values were plotted as shown in 
Fig. 25, and the statistical report, including HR with a 95% confi- 
dence interval and log-rank p-values, was generated. Genes 
LCNI15, OTOS, and INSM2 were identified as the top three most 
significant factors associated with a high probability of experiencing 
the event of interest (i.e., death), while genes MATN/ and KPRP 
were negatively associated with the death event (as shown by their 
negative log(HR) values). As shown in Fig. 26, the overall C-index 
of this model is 0.574, which shows an acceptable predictive model. 


After visualizing the results of the CPH model, we performed the 
same steps as Subheading 3.4.4 to build and evaluate the ML 
models (step 6 in Fig. 1). The following steps were applied: 


1. Data was split into training (80%) and testing sets (20%). 


2. Data was normalized using the min—max normalization. Nor- 
malized data was used to fit and train the four predictive 
algorithms. 


3. Fivefold cross-validation with grid search was used for tuning 
the hyperparameters and _ selecting the optimal 
hyperparameters. 


4. The models were evaluated on the testing set, and the full 
process (steps 1-3) was repeated 20 times to obtain the aver- 
age C-index. 

5. Finally, patients in the testing set were ranked in descending 
order based on their predicted risk scores and split into two 
groups according to the median values. The comparisons 
between the two groups (high-risk and low-risk groups) for 
all the four algorithms were performed using Kaplan—Meier 
curves and log-rank test. 


The details of the setup and evaluation of ML model’s codes are 
the same as in Subheading 3.4.4. The outcomes of the four algo- 
rithms are presented in Figs. 27 and 28. RSF had the highest 
C-index value of 0.53, followed by GBS, SSVM, and CPH. 


374 Le Minh Thao Doan et al. 


LCN15 
OTOS 
NSM2 
ACTC1 
CYP3A7 
RHAG 
C17orf98 
KRT1 
HBE1 
PNMA5 
LOC643719 
FGF3 
ATP9A 
CHRNAQ 


AA724305 
SCGB1A1 
GFAP 
C20orf70 
LOC100288238 
CT45A5 
C3P1 
WDR17 
GPR128 
BE856720 
BQ771683 
AW296252 
BU193864 
BF514583 
SNORD18B 
AI382167 
AICDA 
OR6C4 
WAKMAR2 
DA028946 
BX096436 
MYL1 
ERAS 
DB306783 
AA625691 


Cox Proportional Hazards Model for 50 Gene Features 


log(HR) (95% Cl) 


Fig. 25 Results of the CPH model applied to transcriptomic data. The genes LCN15, OTOS, and INSM2 were 
found as the top three most significant factors associated with low survival with HR values of 1.643, 1.556, 
and 1.507, respectively. Hence, patients having higher values of these three predictors are more likely to have 
a shorter survival time. In contrast, the less than zero log(HR) value predictors (i.e., HR less than one), such as 
MATN7 and KPRP, were negatively associated with the death event. Patients with higher values of these genes 
tend to live longer compared to those who have lower expression values of the same genes 


The Kaplan—Meier curves for breast cancer patients in the 
testing set according to their predicted prognostic score using 
50 features revealed that only the CPH model reported a significant 
difference in the survival distributions of high-risk and low-risk 


Machine Learning Methods for Survival Analysis of Breast Cancer 375 


as lifelines. CoxPHFitter HBEI 0.509 63 7619 0655 0512 
duration cot RHAG +) 
eines MATNI - oss 
baseline estimation CTASAS 40 
number of observations 1904 GP - 395 
number of events observed 1103 EMILINS 182 
partial log-likelihood 7495.20 08306783 -0.397 333 
time fit was run 2021-12-14 1929-47 UTC SLCI4A1 -1,067 65 
LON1S 1.546 
coef exp(coef) exp(coef) lower 95% exp(coef) upper 95% z P 
BX096436 0474 
FGF3 (0.344 11 0.551 3609 0.718 0473 
KRT1 0.28 
BQ771683 -0.132 0876 0.528 1455 -0.510 
ERAS 08 
ATPOA 0.328 1.388 0.766 
10C100288238 - 0.56 
BU193864 -0.169 oases 
OR6C4 -0.269 4s 
KPRP -1.241 0269 0.081 
CYP3AT 0.992 678 
C20orf70 0,080 1.083 0.366 
AA625691 -0.401 0.405 
AICDA -0.257 0.774 0.493 
BES6720 -0.099 595 
GPRI28 -0,098 0.907 0.483 
GFAP 0.261 
DA028946 -0.319 727 
WAKMAR2  -0.276 0.460 
WFDC6 0.258 1.294 
AW296252 -0.141 sas 
PNMAS 0419 1.520 
ACTCI 1.276 3,583 1417 
SNORDI8B  -0.181 0834 0.470 
2P4 -0.475 0622 0.202 
MYLI -0.371 0.690 0.176 
AAT24305 0.218 1.243 0.782 
10C643719 0.379 1.460 0.897 
BCHE 0287 0.798 
Ci7orf98 0,910 2483 0.803 
KRT27 -0.714 0.107 
SCGBIAT 0.213 1.237 0.374 
9 TMED6 -0.721 0.166 
INSM2 1.507 4512 
~ = CHRNAS 0.300 1350 0.753 
ECELI -1.156 0315 0.099 
GIF -0.474 0.623 0.187 
WDRI7 -0.093 0911 0543 Concordance 
OTOS 1.556 474 1.517 Partial AIC 
BF514583 -0.179 0836 0.529 log-likelihood ratio test $5.45 on 506 
AI382167 | -0.237 0.794 -log2(p) of Il-ratio test 


Fig. 26 Cox proportional hazards report for transcriptomic data. The report indicates that OS_MONTHS was the 
duration variable, while OS_STATUS was the event variable used for survival analysis. The figure also reports 
the HR values (exp(coef)), with the corresponding 95% confident interval, and p-values of the 50 extracted 
features. The accuracy prediction of the CPH model, i.e., the C-index, was 0.574, which indicates an 
acceptable model. Similar to the results presented in Fig. 25, the genes LCN15, OTOS, and INSM2 were 
identified as the top three most significant factors associated with the death event with a p-value less than 
0.05, and coefficient/log(HR) values of 1.643, 1.556, and 1.507, respectively 


patients with a p-value of 0.009, as shown in Fig. 29. This result 
might be due to the number of features selected in the preproces- 
sing steps. For this reason, several approaches have been proposed 
to select the optimal subset of features and achieve more accurate 
and robust results [77-79 ]. 


3.5.6 Interpret Model In order to provide an interpretation of the results of models (step 
7 Fig. 1), we computed and plotted the SHAP values for all the 
features in the test set. The SHAP values for the top 20 features are 
shown in Fig. 30. A single patient is represented by each data point. 
The y-axis lists the top 20 most influential genes in descending 
order, represented by its Hugo symbol. The x-axis reports the 
corresponding SHAP values for a specific observation in the testing 


376 Le Minh Thao Doan et al. 


0.56 


CPH 


C-index for 20 runs 


¢ 
0.54 
3 0.52 
5 
0.50 $ 
¢ 
¢ 
0.48 
4 


RSF GBS SSVM 
Model Name 


Fig. 27 C-index comparisons for Experiment 2. Boxplots of C-index results of transcriptomic data using CPH, 
RSF, GBS, and SSVM. The experiments were replicated 20 times. In each experiment, the data was randomly 
divided into training and testing sets with a ratio of 80:20 while guaranteeing the same censoring percentage 
on each set of data. RSF was found to have the highest median C-index, followed by GBS, SSVM, and CPH 


3.6 Experiment 3: 
Integrating Clinical to 
Transcriptomic Data 


set. The higher the SHAP value, the higher the mortality risk of the 
patient represented by the data point. The color represents low or 
high gene expression values. Particularly, the genes LCNI15 and 
AA625691 were identified among the top 10 features by the four 
models. OTOS was selected by CPH, RSF, and SSVM. High values 
of this gene had a positive impact on the models’ outcome (i.e., 
high values of this gene correlate with higher risk of experiencing 
the event of interest). All these genes were associated with patient 
survival and could represent useful prognostic biomarkers for breast 
cancer patients. Most of the gene features in the CPH model were 
convergent and had SHAP value distribution around 0, indicating 
no significant influence on the outcome of the model. 


In the last experiment presented in our tutorial, clinic information 
(from Experiment 1) and transcriptomic data (from Experiment 2) 
were integrated to improve the predictive power of the ML models. 
The workflow followed in this experiment is similar to the one 
previously followed in Subheadings 3.4 and 3.5. First, data was 
loaded and cleaned before performing EDA. Next, the CPH results 
were visualized, and the reports were extracted for further analysis. 


Machine Learning Methods for Survival Analysis of Breast Cancer 377 


Cox Regression 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', CoxPHSurvivalAnalysis(alpha=0.001))]) 
C-index for test set (Hold out): 0.5157661773365008 
Average C-index for 20 runs 0.5149500000000001 


Random Forest Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 

('model', 

RandomSurvivalForest (max_depth=8, max_features='sqrt', 
min_samples_leaf=50, 
min_samples_split=100, n_estimators=500, 
random_state=5))]) 

C-index for test set (Hold out): 0.5301415199691377 
Average C-index for 20 runs 0.53135 


Gradient Boosting Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
GradientBoostingSurvivalAnalysis(n_estimators=1000, 
random_state=5))]) 
C-index for test set (Hold out): 0.5120302125845161 
Average C-index for 20 runs 0.52285 


SVM Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
FastSurvivalSVM(max_iter=500, optimizer='avltree', 
random_state=5))]) 
C-index for test set (Hold out): 0.4973096992954458 
Average C-index for 20 runs 0.5169000000000001 


Fig. 28 ML model’s results for 50 selected genes on the transcriptomic data. The selected hyperparameters, 
initial test results, and the average C-index of each model are displayed in the output. Overall, the average 
performance over 20 runs of the three ML-based models outperformed the CPH model on the analysis of 
transcriptomic data for survival prediction. RSF had the highest average C-index with a value of 0.530, 
followed by GBS, SSVM, and CPH 


Then, we prepared the data for survival analysis and constructed the 
ML models for training and evaluating models performance. 
Finally, the outcomes were interpreted to identify the important 
markers associated with low survival. 


3.6.1 Load Data The data used for this experiment is derived from the data already 
used for Experiment 1 (Subheading 3.4) and Experiment 2 (Sub- 
heading 3.5), i.e., the encoded clinical data and the 50 Hugo 
Symbol extracted using mRMR. There were 1977 and 1903 obser- 
vations in the preprocessed clinical data and transcriptomic data, 
respectively. We extracted the matching observations between these 
two datasets and used them in this experiment. 


378 Le Minh Thao Doan et al. 


‘a KM Curves CPH ‘6 KM Curves RSF 
— HR — HR 
— IR — IR 
0.8 0.8 
o co 
2 2 
fe fa 
a a 
— 0.6 = 06 
°o ° 
2 2 
2 a 
s s 
6 04 B04 
a a 
& & 
0.2 p-value = 0.009 0.2 p-value = 0.066 
00. |, ' 1 1 1 1 1 00°, 1 1 1 1 1 1 
0 50 100 150 200 250 300 0 50 100 150 200 250 300 
Time (months) Time (months) 
HR HR 
Atrisk 191 154 104 58 32 6 0 Atrisk 190 154 100 58 31 ye 0 
Geasured: 0 4 18 38 31 62 65 Censored 1 5 17 38 55 68 73 
Events 0 33 69 95 108 123 426 Events 0 32 74 95 105 116 118 
LR LR 
Atrisk 189 149 109 67 40 "1 1 Atrisk 190 149 113 67 44 10 1 
Censored 1 8 22 44 65 87 94 Censored 0 7 23 44 61 81 86 
Events 0 33 59 79 85 92 95 Events 0 34 54 79 88 99 103 
KM Curves GBS KM Curves SSVM 
1.0 1.0 
— HR — HR 
— LR — LR 
0.8 0.8 
o o 
2 = 
2 ie 
a a 
— 0.6 06 
°° fo} 
2 2 
a a 
s £ 
6 04 B04 
a a 
os Cd 
n n 
Ww Ww 
0.2 p-value = 0.078 0.2 p-value = 0.346 
0.0 4 ' ' ' ' ' ' 00 ' ' ' ' ' ' 
0 50 100 150 200 250 300 0 50 100 150 200 250 300 
Time (months) Time (months) 
HR 
Atrisk 191 153 106 59 31 7 0 HR 
Censored 0 7 17 38 54 66 71 Atrisk 191 154 108 61 33 8 0 
Events 0 31 68 94 106 118 120 Censored ie) 5 17 41 57 68 73 
Events 0 32 66 89 101 115 18 
LR 
Atrisk 189 150 107 66 ra 10 1 LR 
Censored 1 5 23 44 62 83 88 Atrisk 189 149 105 64 39 9 1 
Events 0 35 60 80 87 97 101 Censored 1 7 23 4 59 81 86 
Events 0 34 62 85 92 100 103 


Fig. 29 Kaplan—Meier curves to compare high-risk and low-risk breast cancer groups, stratified by predicted 
survival risk score based on the transcriptomic data when using 50 features. The low-risk group includes 
patients with predicted risk scores above the median value, while the high-risk group comprises patients with 
predicted risk scores lower than the median value. The p-value from the log-rank test was calculated to 
statistically determine the difference in survival functions between the two groups. The figure shows that only 
the CPH model showed a statistically significant difference between risk groups with a p-value of 0.009 


# Load data 
filel pd.read_csv(’Data\clinical.csv’) 
file2 pd.read_csv(’Data\GeneID_MRMR_50.csv’) 


# Merge clinical data 
data = pd.merge(filei,file2, how="inner", on=["PATIENT_ID"]) 


Machine Learning Methods for Survival Analysis of Breast Cancer 379 


(a) CPH (b) RSF 
Hoh Hoh 
oTos oo . LON15 “ - <5 eens 
ACTC1 @- - AA625691 @-- -=. 
AA625691 a SLCI4A1 es ed 
TMED6 os INSM2 Gren ete 
BX096436 ~e- MATN1 Ap <e 
SLC14A1 oo 7 AN296252 p}—---- 
BF514583 oe ERAS Bae 
LON15 > .- EMILIN3 GP 
OR6C4 Co . BE856720 5 
en _ > 3 oTos Ade 3 
80771683 —~e § TMED6 ~ceahe § 
ECEL1 . --} & RHAG e-—- -—o— + ig 
EMILIN3 oo D8306783 a ee 
GIF ~e- SNORD188 Qe---- 
SNORD18B eS BF514583 Speen 
AAT24305 >. BX096436 Se ee 
BU193864 -@ HBEt -Seadlerg> 
AICDA oe CPRI28 oe a 
AW296252 ~<}- KPRP > 
DB306783 + —}- actci P---— 
10 05 oo 05 10 15 —_ -20 -10 0 ) 1) 3 = 
SHAP value (impact on model output) SHAP value (impact on model output) 
(c) GBS ec 
High 
LON15 Que AA625691 <a 
neni AWN296252 <p — 
KPRP a ood Se >: 
GPR128 a oe sod ——- 
ae PN ae ee BX096436 ~<e- 
SLCI4A1 sv nemeed AICDA oe 
AAN625691 a heae - LON15 a 
INSM2 es oe TMED6 . ih 
AN296252 ->--- P AAT24305 —- - : 
TMEDS ~ =O 5 SLCI4A1 o> 3 
HBE1 -op---- 2 DB306783 + <e - 2 
BX096436 adhe = ACTC1 @- - £ 
BCHE . 4 -- -- - GFAP a o- 
SNORD18B age ee KPRP =e 
WAKMAR2 _—> -- BQ771683 ~s- 
L0C 100288238 ee AIPOA -<—> 
were ————- BU193864 -&- 
ECELI ~ = RHAG 1 
BQ771683 Sada conse sails 
EMILINS -—> -- 
be BES856720 > 
5 low 


SHAP value (impact on model output) 

Fig. 30 SHAP summary plot for transcriptomic data for (a) CPH, (b) RSF, (c) GBS, and (d) SSVM models. For 
each gene feature, a single patient is represented by each data point. The y-axis lists the top 20 prognostic 
biomarkers and presents them in descending order based on the ranking provided by the mean of their 
absolute SHAP values. The x-axis reports the SHAP value indicating the impact of the feature on the prediction 
of the algorithm for a specific observation in the testing set. The color represents the value of the feature for 
each patient. The higher the SHAP value the patient had, the higher the risk of death. The genes OTOS and 
AA625691 were respectively found as the most important predictors for the CPH and SSVM models, while 
LCN15 was identified as the top most significant feature for the RSF and GBS models 


380 Le Minh Thao Doan et al. 


Then, an overview of the information and the first five rows of 
the merged data frame was displayed. As shown in Fig. 31, there 
were 79 columns and 1903 rows in the new data frame. No missing 
values were found in the final dataset. 


<class 'pandas.core.frame.DataFrame'> 
Int64Index: 1903 entries, 0 to 1902 
Data columns (total 79 columns): 


# Column Non-Null Count Dtype 


0) CELLULARITY 1903 non-null float64 
1 HER2_SNP6 1903 non-null float64 
2 INFERRED_MENOPAUSAL_STATE 1903 non-null float64 
3 INTCLUST 1903 non-null float64 
4 THREEGENE 1903 non-null float64 
5 CHEMOTHERAPY 1903 non-null float64 
6 ER_IHC 1903 non-null float64 
7 HORMONE_THERAPY 1903 non-null float64 
8 CLAUDIN_SUBTYPE 1903 non-null float64 
9 LATERALITY 1903 non-null float64 
10 RADIO THERAPY 1903 non-null float64 
11 HISTOLOGICAL_SUBTYPE 1903 non-null float64 
12 BREAST_SURGERY 1903 non-null float64 
13 CANCER_TYPE_ DETAILED 1903 non-null float64 
14 ER_STATUS 1903 non-null float64 
15 HER2_STATUS 1903 non-null float64 
16 ONCOTREE_CODE 1903 non-null float64 
17 PR_STATUS 1903 non-null float64 
18 LYMPH _NODES_EXAMINED POSITIVE 1903 non-null float64 
19 NPI 1903 non-null float64 
20 AGE_AT_DIAGNOSIS 1903 non-null float64 
21 COHORT 1903 non-null float64 
22 GRADE 1903 non-null float64 
23 TUMOR_SIZE 1903 non-null float64 
24 TUMOR_STAGE 1903 non-null float64 


Fig. 31 Output of the integrated clinical and transcriptomic data information. The output gives an overview of 
the merged data frame, including total entries, data types, names of columns, and the number of validated 
data points. There are 1903 entries and 79 columns in the merged data frame. The first 25 features are shown 
in this figure. There are no missing values in the final dataset 


Machine Learning Methods for Survival Analysis of Breast Cancer 381 


The number of duplicate values in data 0 
After the first preprocessing, the shape of data is (1903, 78) 
Missing value number: 0 


Fig. 32 Output of the integrated clinical and transcriptomic data preprocessing. After removing PATIENT_ID 
column, the new dataset consists of 1903 rows and 78 columns. There are no duplicates and missing values 
in the final dataset 


# Have a quick look at data 


data. info() 
data. head () 


3.6.2 Preprocess and For this experiment, it is optional to check the duplicate and 

Explore Data missing values since the data was already processed in the previous 
two experiments. However, it is always a good practice to conduct 
the preprocessing step after loading data (step 3 in Fig. 1). The 
PATIENT ID column was removed from the dataset before 
moving into the next step of the pipeline. As shown in Fig. 32, 
the new shape of the data was 1903 rows and 78 columns. No 
duplicates and missing values were found in the final dataset. 


# Preprocess data & Explore data 
# Check duplicate values 
print(’The number of duplicate values in data’, data.duplicated().sum() 


) 


# Drop unused cols: Based on data.info(), we will drop some unused cols 
and null cols 

drop_list = [’PATIENT_ID’] 

df = data.drop(drop_list, axis=1) 

print(’After the first preprocessing, the shape of data is’, data.shape 


) 


# Check missing values again 
print(’Missing value number:’, df.isna().sum().sum()) 


Next, the correlation matrix was plotted to provide more 
insights into the relationships of features in the merged dataset 
(Fig. 33). Except for some pair of clinical features such as ER_STA- 
TUS and ER_IHC, no high correlated values were observed 
between clinical and transcriptomic features. 


# Correlation analysis 
colormap = plt.cm.Reds 
plt.figure(figsize=(15,15)) 
sns.heatmap(df.corr() ,linewidths=0.1,vmax=0.8, 
square=True, cmap = colormap, linecolor=’white’) 
plt.title(’Correlation matrix’, fontsize=14) 
plt .show() 


Le Minh Thao Doan et al. 


382 


SaiNjea} OIWOJdUISUe PUL ;BIIUI]D UBBMJ9q PAAJBSGO B9M SanjeA paye|aij0o YyBiy ou ‘(O} ‘Bl4 ul uses AjsnolAasd se) sanjea 
uole}a.u09 Yybiy Hulaey sainyeay |edIUI}D JO Jed SWOS JO} Jd9dxXJ ‘anjeA JAMO] & Saljdwu! 10j09 491461) e pue anjea JayBiy e Saljduui 10/09 JayJep e sayM ‘SanjeA UOITe|a1109 
ay} Sjuaseidas Aylsuap JO}OD ‘Seunyess ay} UdaMjeq UO!e|aU09 OU Huleg 049Z JO AaNjeA ayy UYIM ‘(UOITe|a09 SAISOd JOaped) [+ 0} (UONE|a09 BANeHeu yOajad) | — 
Wo.) sabued pue sainquye jo sured ayy je U9EMI9q UONE|A09 Jeaul] 94) SJIIdap XUJEW UONE|a09 ay] “SeINyes) DIWOdUOSUe pUe JedIUI}D JO XUVEW UOIe|E.U09 E¢ “BY 


ALVLS WSNvdON3WGSYYS4NI 


GSTVLAG 3dAL MIONVO 


AuaOUNS LsvauE 
3dALENS TVOISOTOLSIH 
AdVuaHL Olde 


ALMVYALVT 


SHLNOW SO 
AdALENS_NIGNV1O 
AdVa3HL“3NOWHOH 


SNLWLS_ZY3H 


SNLWIS Ha 


39VvLS HOWNL 
3000 33YLOONO 


3ZIS HOWNL 
aavuo 


BAILISOd GANINVX3 S3GON HdWA1 
OHI Ya 


SNMOWANONASNON WL 
SISONOVIG Lv 39v 


IdN 
AdVeSHLOWSHO 


SNSOSSYHL 


8€788Z001901 
ASNTOLNI 


I svNYHO 
sca 
22L8 
BHO 
dv4o 
ozz9se38 
Leaszavy 
LWedAd 
¥O9NO 
svua 
Lt 
9€7960XE 
SLNOT 
Wwrlois 
egzgoraa 
enrINa 
bd€9 
svsplo 
LNLWW 
OVHY 
v3aH 
zoizeely 
eespisse 
SOLO 
240M 
id 
11303 
ZWSNI 
Wwiaoos 
86HOLLD 
612869901 
HAW 
8}GHONS 
SVIWNd 
9o04M 
gveazova 
8ZLudo 
vaolv 
ouuo0zo 
dud 
voeesing 
vedl¥ 
£89108 
e494 
SNLWLS-SO 
L4yOHOO 
SMLWLS Ud 
QdNS ZYSH 
ALEWINTTSO 


6VNYHO 
vo-- 


zszo6zMVv 
ZuVAVM 
dvd9 
ozz9seaa 
Leoszavv 
LvEdAD 
yO9UO 
8€Z88Z001907 


zo-- svua 


zgieaely 
easplssa 


00+ 


612878001 
HAW 
@8LGNONS 
SVNd 
9904M 
‘9P68z0VG 
8ZbedD 
vaolv 
o0029 
did 
vose6ing 
V6dLV 
891/08 
e494 
SNlvls SO 
SHINOW SO _ 
SMOWANONASNON SW 
3OVLS NOWNL 
3ZIS YOWNL 
adqvyo 
IHOHOD 
SISONOVIC LY 39V 
IdN a fe 
3AILISOd GANINVX3 S3GON-HAWAT 
SMLVLS Md 
3009 334109NO 
SMLvIS cuSH 
SMLWLS YS 
Galviad AdAL Y3ONVD 
Ada9NNS 1Svays 
AdALENS _TWWOIDOTOLSIH 
AdWYSHI O1GWe 
ALIWEBLVT 
3dALaNs_NiGnv19 
AdWYSHL SNONSOH 
OHI Ya 
AdWa3HLOW3SHS 
N39334HL 
SMOLIN! 7 
SLVLS “IVSNVdONSW G3YYSNI 
9dNS Zu3H 

i Avavinnao 


z04 


voy 


904 


xUJeW UO], 


80 


Machine Learning Methods for Survival Analysis of Breast Cancer 383 


Time Distribution for Censored and Observed Events 


120 mmm Observed Events (Death) 
lm Censored Events 

100 

80 


Frequency 
3 


| 
50 


ie 
0 | 
0 100 150 200 250 300 350 


Time (months) 


Fig. 34 Distribution of follow-up times of censored and observed (death) events in the integrated clinical and 
transcriptomic data. The data contains 42% of the censored observations. The distribution is right-skewed, 
and it is different between censored patients and those who experienced an event. The censored group has 
more patients with longer survival times 


In the next step, the follow-up survival-time distribution was 
plotted. Overall, the time distribution plot for this experiment was 
similar to the one observed in Experiment 2 and shown in Fig. 34. 
There were 42% of the censored observations in the integrated 
clinical and transcriptomic data. 


# Time Distribution of Death and Censored 
num_censored = df.shape[0] - df["OS_STATUS"].sum() 
print("%.1£%% of records are censored" % (num_censored/df.shape [0]*100) 


) 


plt.figure(figsize=(10, 6)) 
val, bins, patches = 
plt.hist ((df.query(’OS_STATUS == 1’)[’OS_MONTHS’], 
df .query(’OS_STATUS == 0°)[’OS_MONTHS’]), 
bins=30, stacked=True) 
. = plt.legend(patches, ["Time of Death", "Time of Censored"]) 
plt.title("Time Distribution of Censored and Death Patients") 


The rest of Experiment 3 includes plotting the CPH model, 
preparing and evaluating the ML algorithms, and interpreting the 


384 Le Minh Thao Doan et al. 


3.6.3 Plot Cox 
Proportional Hazards Mode! 


3.6.4 Set Up and 
Evaluate Machine Learning 
Algorithms 


3.6.5 Interpret Model 


results. These were set up as previously done in Experiment 1 in 
Subheadings 3.4.3, 3.4.4, and 3.4.5. Hence, we do not repeat the 
code in the sections below, but we discuss and interpret the results. 
However, the complete notebook can be accessed at https:// 
github.com/Angione-Lab /survival_analysis_tutorial. 


Following the same approach presented in Subheading 3.4.3 in 
Experiment | (step 5 in Fig. 1), the merged data was normalized, 
and the CPH model was fitted to generate the log(HR)s and the 
final statistical report. Figure 35 shows that AGE_AT_DIAGNO- 
SIS was identified as the most significant factor associated with a 
higher probability of experiencing the event, with a log(HR) value 
of 3.800. In contrast, the genes ECELJ and KPRP were found 
negatively associated with the death event as shown by their nega- 
tive log(HR) values. Patients having higher expression values for 
these two genes tend to live longer compared to those who show 
lower expression values of the same genes. 


Following the same steps (step 6 in Fig. 1) described in Subheading 
3.4.4, data was prepared by splitting it into training and testing sets. 
The experiment pipeline was the same as in Experiments 1 and 
2. First, the ML models were trained using a grid search approach 
with fivefold cross-validation to identify the optimal hyperpara- 
meters. Next, the fitted models were evaluated on the testing set. 
The data was split into training and testing sets, and the evaluation 
was repeated 20 times to obtain an average C-index. Finally, 
Kaplan—Meier curves and log-rank tests were applied to statistically 
compare the differences between the two predicted risk groups. 
The high-risk group contained the patients in the testing set with 
the expected risk scores above the median value. In contrast, the 
low-risk group included the patients with the expected risk score 
below the median value. This process helped us to identify the 
optimal algorithm that could successfully estimate the survival risk 
of the cancer patient. 

Figures 36 and 37 show the results of Experiment 3. RSF was 
the best performing model with an average C-index value of 0.683, 
followed by GBS, SSVM, and CPH, with C-indices equal to 0.675, 
0.673, 0.670, respectively. Kaplan-Meier curves, reported in 
Fig. 38, show that all four models had statistically significant differ- 
ences in survival distributions between risk groups. RSF was again 
the best survival algorithm with the lowest p-value of 1.197E-14. 


For the final step of the pipeline (step 7 in Fig. 38), SHAP values 
were used to interpret the results of the models. The SHAP plots of 
the top 20 most important features are reported in Fig. 39. A single 
patient is represented by each data point for each feature. The y-axis 
lists the top 20 most influential features in descending order, while 


Machine Learning Methods for Survival Analysis of Breast Cancer 385 


Cox Proportional Hazards Mode for Clinical and Transcriptomic data 


AGE AT DIAGNOSIS ' ———0O--t 
LYMPH NODES EXAMINED_POSITIVE ' —oO— 
KRTI 


OTOS 
TUMOR SIZE 
ACTC1 


LCN15 
TUMOR_STAGE 
GFAP 


SCG 
™ B_NONS NON YOU 
THREEGENE 
ATP9A 
ONCOTREE_CODE 
CANCER_TYPE_DETAILED 
C17orf98 
CHEMOTHERAPY 
WFDC6 
HER2_STATUS 
HER2 SNP6 
LOC643719 


AA724305 
SNORD18B 


Pd 
BREAST_SURGERY 
HBE1 


a 
| ee @ see 
-———0-———1 
= 
eee Genes | See ee | 
a 
ip 
—_——_ OS 
——_—— ft 
—— I} 
—_——— Ot 
——_——— 
+} 
it 
—-—_0-————+ 
——_Oo—_——_ 
—_——}~-f} 
ne 
—O-1 
9 
-oO- 
—-O——t 
een 
—+oO—_ 
-——Oo—_1 
-oO-4 
-oO-4 
| es, ee | 
HOH 
— 
RHAG —_ OO 
DB306783 ——— t+ 
BF514583 
CYP3A7 ————— 
C200r70 -——_1-—____+ 
CLAUDIN_SUBTYPE oOo 
NTCLUST O44 
HORMONE_THERAPY HH 
ER_IHC —i— 
LATERALITY HH 
CELLULARITY OH 
PR STATUS HoH 
LOC 100288238 Or! 
GPR128 or 
BE856720 oO; 
—t—/ 
—i— 
-———_0+————— 
——1— 
HH 
—O---1 
——————_0-+———4 
— OO 
—O44 
—O——4 
-——-—1 
———O—-— 
-—O—r! 
po 
te 
a 
—i}— 
—oO—_H 
ros 
-—__1}—_+ 
oO 
——————— Ot 
—_—_—_——-0 
—_——_—_—_—_O——— 1 
on A eee 
— 
—_———o-—— 
SY 
——_— I 


AICDA 

AW296252 

SLC14A1 

DA028946 

BxX096436 

AI382167 

WAKMAR2 

CT45A5 

AA625691 

BQ771683 

ER_STATUS 
HISTOLOGICAL_SUBTYPE 
NFERRED_MENOPAUSAL_STATE 


log(HR) (95% Cl) 


Fig. 35 Results of the CPH model applied to the merged clinical and transcriptomic dataset. AGE_AT_DIAG- 
NOSIS was identified as the most significant factors associated with the death events with a log(HR) value of 
3.800, followed by LYMPH_NODES_EXAMINED_POSITIVE, genes KRT1, OTOS, and ACTC7. In contrast, the 
negative HR value predictors, such as gene ECEL7 and KPRP, were negatively associated with the death event. 
Patients with higher values of these factors tend to live longer compared to those who have lower expression 
values of those genes 


386 


C-Index 


Le Minh Thao Doan et al. 


0.70 


0.68 


0.66 


0.64 


0.62 


C-index for 20 runs 


“—_ 


4 


RSF GBS SSVM 
Model Name 


Fig. 36 C-index comparisons for Experiment 3. Boxplots of C-index results of the integrated clinical and 
transcriptomic data using CPH, RSF, GBS, and SSVM. The experiments were replicated 20 times. In each 
experiment, the data was randomly divided into training and testing sets with a ratio of 80:20 while 
guaranteeing the same censoring percentage in each splitting. On average, RSF was found to have the 
highest median C-index, followed by GBS, SSVM, and CPH 


the x-axis reports their corresponding SHAP values for a specific 
observation in the testing set. The higher the SHAP value asso- 
ciated with a patient, the higher the mortality risk that the patient 
would have. AGE_AT_DIAGNOSIS was selected among the top 
features by the four models, suggesting the high impact of this 
feature on the survival outcome. Figure 39 also identifies some 
critical biomarkers affecting the prediction outcomes of algorithms, 
including genes ERAS, SLCI4A1, and LCN15. Specifically, high 
values of ERAS and SLC14A/1 had a negative impact on the out- 
come of the models (i.e., high expression values of these genes 
correlated negatively with the probability of experiencing the 
event), while high values of LCNI5 showed a positive impact on 
the outcome of models (i.e., high values of this gene correlated 
positively with the probability of experiencing the event). All these 
genes were associated with patient survival and could be useful 
prognostic biomarkers for breast cancer patients. 


Machine Learning Methods for Survival Analysis of Breast Cancer 387 


Cox Regression 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', CoxPHSurvivalAnalysis(alpha=1))]) 
C-index for test set (Hold out): 0.6594013249366157 
Average C-index for 20 runs 0.66945 


Random Forest Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 

('model', 

RandomSurvivalForest (max_depth=8, max_features='sqrt', 
min_samples_leaf=50, 
min_samples_split=100, n_estimators=500, 
random_state=5))]) 

C-index for test set (Hold out): 0.6730187290422834 
Average C-index for 20 runs 0.6833 


Gradient Boosting Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
GradientBoostingSurvivalAnalysis (learning rate=0.01, 
n_estimators=800, 
random_state=5))]) 
C-index for test set (Hold out): 0.6474809847059786 
Average C-index for 20 runs 0.67505 


SVM Survival 
Pipeline(steps=[('scaler', MinMaxScaler()), 
('model', 
FastSurvivalSVM(max_iter=500, optimizer='avltree', 
random_state=5))]) 
C-index for test set (Hold out): 0.6683978081295494 
Average C-index for 20 runs 0.6729499999999999 


Fig. 37 ML model results for the integration clinical and transcriptomic data. The selected hyperparameter, 
initial test result, and average C-index of each model are displayed in the outcome. Overall, the average 
performance over 20 runs of the three ML models outperformed CPH when using the integrated data for 
survival prediction. RSF had the highest average C-index with a value of 0.683, followed by GBS, SSVM, 
and CPH 


4 Conclusions 


The application of ML models on the integration of clinical and 
omics data can provide data insights to improve personalized treat- 
ment and precision oncology. However, there are still some chal- 
lenges to overcome, mainly related to the high dimensionality of 
the data and the heterogeneity of samples. Hence, better 
approaches to develop accurate predictive models and identify crit- 
ical prognostic markers need to be implemented. In this tutorial, we 


388 Le Minh Thao Doan et al. 


‘a KM Curves CPH ‘i KM Curves RSF 
— HR — HR 
— LR === LR 
0.8 0.8 
cc o 
4 = 
2 e 
7 3 
— 06 = 0.6 
co} °° 
2 2 
2 2 
s § 
6 04 6 04 
a a 
hi 3 
Ww Ww 
02 p-value = 9.873e-12 02 p-value = 1.197e-14 
0.0 1 1 t t ' t ' 0.0 1 t i ' ' t t 
) 50 100 150 200 250 300 0 50 100 150 200 250 300 
Time (months) Time (months) 
HR HR 
Atrisk 191 142 98 49 26 3 0 Atrisk 191 133 85 43 25 5 0 
Censored (0) 5 13 30 41 48 48 Censored 0 8 16 32 39 45 47 
Events (0) 44 80 112 124 140 143 Events 0 50 90 116 127 141 144 
LR LR 
Atrisk 190 163 123 75 45 20 1 Atrisk 190 172 136 81 46 18 1 
Censored ie) 6 23 57 we 95 111 Censored (0) 3 20 55 79 98 112 
Events (0) 21 44 58 68 75 78 Events 0 15 34 54 65 74 7 
KM Curves GBS KM Curves SSVM 
1.0 1.0 
— HR HR 
— IR — LR 
0.8 0.8 
o o 
2 2 
e ie 
2 3 
= 0.6 = 0.6 
°° co} 
2 2 
5 ral 
s £ 
rs] 0.4 o 0.4 
a a 
oy a 
n n 
Ww Ww 
0.2 p-value = 1.194e-09 0.2 p-value = 4.036e-09 
0.0 1 1 ' ' ' 1 7 0.0 1 ' r 1 ' r r 
0 50 100 150 200 250 300 0 50 100 150 200 250 300 
Time (months) Time (months) 
HR HR 
Atrisk 191 138 95 51 29 5 0 Atrisk 191 137 95 51 30 5 0 
Censored (0) 8 17 33 41 49 51 Censored ie) 7 14 32 42 50 53 
Events 0 45 79 107 121 137 140 Events 0 47 82 108 119 136 138 
LR LR 
Atrisk 190 167 126 73 42 18 4 Atrisk 190 168 126 73 41 18 1 
Censored (0) 3 19 54 red 94 108 Censored 0 4 22 55 76 93 106 
Events (0) 20 45 63 71 78 81 Events 0 18 42 62 73 79 83 


Fig. 38 Kaplan—Meier curves to compare high-risk and low-risk groups, stratified by predicted survival risk 
scores. The low-risk group (n = 190) included patients with predicted risk scores above the median value, 
while the high-risk group (n = 190) comprised patients with risk scores calculated to determine the statistical 
difference between the survival distributions of the two groups. The figure shows statistically significant 
differences between survival groups for all four models with a p-value lower than 0.0001 


showed that ML models appear as salient and successful methods to 
analyze medical data and predict patient-specific survival outcomes. 
Our study proposed a step-by-step protocol to design and evaluate 
the traditional statistical model CPH and three ML models for 
breast cancer survival, i.e., RSF, GBS, and SSVM. The performance 
of the ML models was assessed using the METABRIC dataset. The 
presented pipeline, based on optimizing C-index by using a grid 
search approach and a fivefold cross-validation method, has a great 
potential to improve the performance of models and generalize the 


Machine Learning Methods for Survival Analysis of Breast Cancer 389 


(a) CPH (b) RSF 
a epeenpegdpnamee cen o-——-- - ———— 
Hi--« @&- & --—- -— 
a i .- _- ——= 
i. — @- 
vel —- + 
-otfte - 
ts = 
ie = 
i “ll R Sd 
> wn ” 
> - 
> sh 
5 ® 
+> e 
— sd 
ee b 
+ s 
+ e 
(c) GBS (d) SSVM 
f DIE tet eles aes ao saat 
-o—--—- ez 
— -—- Lt 
— 1. 
+> wee 
| tte 
qe ot. 
oa ell 
t-: Ls 
7 nu 
ode i eral 
AAS —+>--- afore 
° 4 wo 
+ n “tl 
® > 
+: - > 
sia > 
' —_ 
aes = 
+ + 


pact on model output) 


Fig. 39 SHAP summary plot for the integrated clinical and transcriptomic data for (a) CPH, (b) RSF, (c) GBS, 
and (d) SSVM models. For each gene feature, a single patient is represented by each data point. The y-axis 
lists the top prognostic biomarkers and presents them in descending order based on their ranking provided by 
the mean of their absolute SHAP values. The x-axis reports the SHAP value indicating the impact of the feature 
on the algorithm’s prediction outcome for a specific observation in the testing set. The color represents the 
value of the feature for each instance. The higher the SHAP value associated with the patient, the higher the 
risk of death. For the survival risk predictors, AGE_AT_DIAGNOSIS was consistently selected by the four 
models as the top significant factors impacting the outcome of models. Specifically, high values of this feature 
correlated with a higher probability of experiencing the event. Other biomarkers such as ERAS, SLC74A7, and 
LCN75 were identified as features having a high impact on the prediction outcomes and associated with the 
predicted survival likelihood 


models for survival prediction on unseen data. Furthermore, we 
used SHAP values to interpret the model results and identify the 
features that had the highest impact on the prediction outcomes of 
models. The improvement in ML interpretability will help research- 
ers and clinicians understand more about ML models and thus gain 
more credibility and trust. This tutorial represents one step further 
to bring these novel solutions to clinicians and to the public. Our 


390 Le Minh Thao Doan et al. 


Acknowledgements 


work offers an exploratory strategy to enhance the biological 
understanding of the prognosis predictive ML models. 

We conducted three different experiments for clinical data, 
transcriptomic data, as well as the integration of these two data 
types. Incorporating clinical and mRNA expression data is crucial 
to uncover a sequence of complicated interactions in multiple 
biological processes and complex human conditions. Due to the 
high-dimensional nature of transcriptomic data, mRMR was 
applied as a feature selection technique. This preprocessing step 
also helps to boost the performance of models, save computational 
resources, and reduce overfitting. 

Even if we presented the most used ML techniques to perform 
survival analysis on different types of data, there are some limita- 
tions to this tutorial. We only considered three ML algorithms, 
namely RSF, GBS, and SSVM, because of their popularity and 
effectiveness in analyzing survival data. However, other approaches 
based on deep learning, a branch of ML, have also proved their 
capability to work with survival data. Some packages are available to 
run deep-learning-based models for cancer prognosis, such as 
DeepSurv [80], Cox-nnet [43], and DeepProg [81]. A competitive 
performance comparison between our approaches to other deep- 
learning-based models could enable researchers to explore and 
obtain optimal ways to supplement conventional survival analysis 
techniques. 

The number of features selected in our study could also have 
limited the findings when using transcriptomic data. To save com- 
putation time and resources, we only extracted 50 features to 
demonstrate our approach. Future studies could adopt our frame- 
work and repeat our steps exploring different numbers of features. 

In summary, by performing survival analysis across different 
models and data, our results revealed that ML approaches were 
capable of generating accurate prognostic predictions. The 
ML-based models showed a better performance compared to tradi- 
tional statistic methods, i.e., CPH model. Particularly, RSF 
reported the best performance results in analyzing the transcrip- 
tomic data (Experiment 2) and the integrated clinical and transcrip- 
tomic data (Experiment 3), while SSVM was the best performing 
model when using clinical data only (Experiment 1). 


AO and CA acknowledge the support of Earlier.org through their 
Research Grant “Application of computational models of breast 
cancer for early-detection personalised tests.” CA acknowledges 
the support of EPSRC and The Alan Turing Institute through 
their Turing Network Development Award, and the Children’s 
Liver Disease Foundation through their Research Grant. 


Machine Learning Methods for Survival Analysis of Breast Cancer 


References 


1. 


10. 


ll. 


12. 


13. 


Ferlay J, Héry C, Autier P, Sankaranarayanan R 
(2010) Global burden of breast cancer. In: 
Breast cancer epidemiology. Springer, pp 1-19 


. Cancer Research UK (2021) Breast cancer sta- 


tistics. URL https://www.cancerresearchuk. 
org/health-professional/cancer-statistics /sta 
tistics-by-cancer-type/breast-cancer 


. Office for National Statistics (2019) Cancer 


survival in England Cancer survival in England: 
national estimates for patients followed up to 
2017. URL https://www.ons.gov.uk/ 
peoplepopulationandcommunity/ 
healthandsocialcare/conditionsanddiseases/ 
bulletins/cancersurvivalinengland/ 
nationalestimatesforpatientsfollowedupto2017 


. Robson M, Im SA, Senkus E, et al (2017) 


Olaparib for metastatic breast cancer in patients 
with a germline BRCA mutation. New Engl J 
Med 377(6):523-533 


. De Bin R, Sauerbrei W, Boulesteix AL (2014) 


Investigating the prediction ability of survival 
models based on both clinical and omics data: 
two case studies. Stat Med 33(30):5310-5329 


. Hira MT, Razzaque M, Angione C et al (2021) 


Integrated multi-omics analysis of ovarian can- 
cer using variational autoencoders. Sci Rep 
11(1):1-16 


. Conesa A, Beck S (2019) Making multi-omics 


data accessible to researchers. Sci Data 6(1): 
14 


. Vijayakumar S, Conway M, Lid P, Angione C 


(2018) Optimization of multi-omic genome- 
scale models: Methodologies, hands-on tuto- 
rial, and perspectives. Metabolic Netw 
Reconstr Model 1716:389-408 


. Angione C (2019) Human systems biology 


and metabolic modelling: a review-from dis- 
ease metabolism to precision medicine. 
BioMed Res Int 2019 

Zhao Z, Zhang KN, Wang Q et al (2021) 
Chinese Glioma Genome Atlas (CGGA): a 
comprehensive resource with functional geno- 
mic data from Chinese glioma patients. Geno- 
mics, proteomics Bioinformatics 19(1):1 
Iuliano A, Occhipinti A, Angelini C et al 
(2018) Combining pathway identification and 
breast cancer survival prediction via screening- 
network methods. Front Genet 9:206 

Gyorffy B (2021) Survival analysis across the 
entire transcriptome identifies biomarkers with 
the highest prognostic power in breast cancer. 
Comput Struct Biotechnol J 19:4101-4109 
Higdon R, Earl RK, Stanberry L et al (2015) 
The promise of multi-omics and clinical data 
integration to identify and target personalized 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21, 


22, 


23. 


24 


25, 


26. 


391 


healthcare approaches in autism spectrum dis- 
orders. Omics J Integr Biol 19(4):197-208 
Hasin Y, Seldin M, Lusis A (2017) Multi-omics 
approaches to disease. Genome Biol 18(1): 
1-15 

Yaneske E, Angione C (2018) The poly-omics 
of ageing through individual-based metabolic 
modelling. BMC Bioinf 19(14):83-96 

Yan J, Risacher SL, Shen L, Saykin AJ (2018) 
Network approaches to systems biology analy- 
sis of complex disease: integrative methods for 
multi-omics data. Brief Bioinf 19(6): 
1370-1381 

Occhipinti A, Hamadi Y, Kugler H et al (2020) 
Discovering essential multiple gene effects 
through large scale optimization: an applica- 
tion to human cancer metabolism. [EEE/ 
ACM Trans Comput Biol Bioinf 18:2339 
Eyassu F, Angione C (2017) Modelling pyru- 
vate dehydrogenase under hypoxia and its role 
in cancer metabolism. R Soc Open Sci 4(10): 
170360 

Zhao L, Dong Q, Luo C et al (2021) DeepO- 
mix: A scalable and interpretable multi-omics 
deep learning framework and application in 
cancer survival analysis. Comput Struct Bio- 
technol J 19:2719-2725 

Yaneske E, Zampieri G, Bertoldi L et al (2021) 
Genome-scale metabolic modelling of SARS- 
CoV-2 in cancer cells reveals an increased shift 
to glycolytic energy production. FEBS Lett 
595(18):2350-2365 

Angione C (2018) Integrating splice-isoform 
expression into genome-scale models charac- 
terizes breast cancer metabolism. Bioinformat- 
ics 34(3):494-501 

Anaya J, Reon B, Chen WM et al (2016) A 
pan-cancer analysis of prognostic genes. PeerJ 
3:e1499 

Zhu B, Song N, Shen R et al (2017) Integrat- 
ing clinical and multiple omics data for prog- 
nostic assessment across human cancers. Sci 
Rep 7(1):1-13 


. Islam MM, Haque MR, Iqbal H et al (2020) 


Breast cancer prediction: a comparative study 
using machine learning techniques. SN Com- 
put Sci 1(5):1-14 

Zampieri G, Vijayakumar S, Yaneske E, 
Angione C (2019) Machine and deep learning 
meet genome-scale metabolic modeling. PLoS 
Comput Biol 15(7):e1007084 

Alabi RO, Elmusrati M, Sawazaki-Calone I 
et al (2020) Comparison of supervised machine 
learning classification techniques in prediction 


392 


27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


Le Minh Thao Doan et al. 


of locoregional recurrences in early oral tongue 
cancer. Int J Med Informatics 136:104068 
Culley C, Vijayakumar S, Zampieri G, Angione 
C (2020) A mechanism-aware and multiomic 
machine-learning pipeline characterizes yeast 
cell growth. Proc Natl Acad Sci 117(31): 
18869-18879 

Chugh G, Kumar S, Singh N (2021) Survey on 
machine learning and deep learning applica- 
tions in breast cancer diagnosis. Cogn 
Comput: 1-20 

Akram M, Iqbal M, Daniyal M, Khan AU 
(2017) Awareness and current knowledge of 
breast cancer. Biol Res 50(1):1-23 

Simmons CP, McMillan DC, McWilliams K 
et al (2017) Prognostic tools in patients with 
advanced cancer: a systematic review. J Pain 
Symptom Manag 53(5):962-970 

Ascolani G, Occhipinti A, Lio P (2015) Mod- 
elling circulating tumour cells for personalised 
survival prediction in metastatic breast cancer. 
PLoS Comput Biol 11(5):e1004199 

Wang P, Li Y, Reddy CK (2019) Machine 
learning for survival analysis: A survey. ACM 
Comput Surv (CSUR) 51(6):1-36 

Mariotto AB, Noone AM, Howlader N et al 
(2014) Cancer survival: an overview of mea- 
sures, uses, and interpretation. J Natl Cancer 
Inst Monographs 2014(49):145-186 

Austin PC (2017) A tutorial on multilevel sur- 
vival analysis: methods, models and applica- 
tions. Int Stat Rev 85(2):185-203 

Iuliano A, Occhipinti A, Angelini C et al 
(2016) Cancer markers selection using 
network-based cox regression: a methodologi- 
cal and computational practice. Front Physiol 
7:208 

Yang Y, Lu Q, Shao X et al (2018) Develop- 
ment of a three-gene prognostic signature for 
hepatitis b virus associated hepatocellular carci- 
noma based on integrated transcriptomic anal- 
ysis. J Cancer 9(11):1989 

Kiebish MA, Cullen J, Mishra P et al (2020) 
Multi-omic serum biomarkers for prognosis of 
disease progression in prostate cancer. J Transl 
Med 18(1):1-10 

Hao J, Kim Y, Mallavarapu T et al (2019) 
Interpretable deep neural network for cancer 
survival analysis by integrating genomic and 
clinical data. BMC Med Genomics 12(10): 
1-13 

Moncada-Torres A, van Maaren MC, Hendriks 
MP? et al (2021) Explainable machine learning 
can outperform Cox regression predictions and 
provide insights in breast cancer survival. Sci 
Rep 11(1):1-13 


40. 


41. 


42. 


43. 


44 


45. 


46 


47 


48. 


49. 


50. 


51. 


52. 


53. 


Akai H, Yasaka K, Kunimatsu A et al (2018) 
Predicting prognosis of resected hepatocellular 
carcinoma by radiomics analysis with random 
survival forest. Diagn Interv imaging 99(10): 
643-651 

Bibault JE, Chang DT, Xing L (2021) Devel- 
opment and validation of a model to predict 
survival in colorectal cancer using a gradient- 
boosted machine. Gut 70(5):884-889 

Wang H, Zheng B, Yoon SW, Ko HS (2018) A 
support vector machine-based ensemble algo- 
rithm for breast cancer diagnosis. Eur J Oper 
Res 267(2):687-699 

Ching T, Zhu X, Garmire LX (2018) 
Cox-nnet: an artificial neural network method 
for prognosis prediction of high-throughput 


omics data. PLoS Comput Biol 14(4): 
e1006076 
. Huang Z, Zhan X, Xiang S et al (2019) 


SALMON: survival analysis learning with 
multi-omics neural networks on breast cancer. 
Front Genet 10:166 

Cheon S, Agarwal A, Popovic M et al (2016) 
The accuracy of clinicians’ predictions of sur- 


vival in advanced cancer: a review. Ann Palliat 
Med 5(1):22-29 


. Pereira B, Chin SF, Rueda OM et al (2016) 


The somatic mutation profiles of 2,433 breast 
cancers refine their genomic and transcriptomic 
landscapes. Nat Commun 7(1):1-16. https:// 
doi.org/10.1038 /ncomms11479 


. Lundberg SM, Lee SI (2017) A_ unified 


approach to interpreting model 
predictions. In: Proceedings of the 31st inter- 
national conference on neural information pro- 
cessing systems, pp 4768-4777 

Singh R, Mukhopadhyay K (2011) Survival 
analysis in clinical trials: Basics and must know 
areas. Perspect Clin Res 2(4):145 

Cox DR (1972) Regression models and life- 
tables. J R Stat Soc B (Methodol) 34(2): 
187-202 

Ishwaran H, Kogalur UB, Blackstone EH, 
Lauer MS (2008) Random survival forests. 
Annals Appl Stat 2(3):841-860 

Breiman L (2001) Random forests. Mach 
Learn 45(1):5-32 

Azar AT, Elshazly HI, Hassanien AE, Elkorany 
AM (2014) A random forest classifier for 
lymph diseases. Comput Methods Programs 
Biomed 113(2):465-473 

Qu Z, Li H, Wang Y et al (2020) Detection of 
electricity theft behavior based on improved 
synthetic minority oversampling technique 


and random forest classifier. Energies 13(8): 
2039 


54. 


55. 


56. 


57. 


58 


59. 


60. 


6l. 


62. 


63. 


64. 


65. 


66. 


67. 


68. 


Machine Learning Methods for Survival Analysis of Breast Cancer 


Harrell FE, Califf RM, Pryor DB et al (1982) 
Evaluating the yield of medical tests. JAMA 
247(18):2543-2546 

Hothorn T, Biihlmann P, Dudoit S et al (2006) 
Survival ensembles. Biostatistics 7(3):355-373 
Natekin A, Knoll A (2013) Gradient boosting 
machines, a tutorial. Front Neurorobotics 7:21 
Friedman JH (2001) Greedy function approxi- 
mation: a gradient boosting machine. Ann Stat 
29:1189-1232 


. Ridgeway G (1999) The state of boosting. 


Comput Sci Stat:172-181 

Khan FM, Zubek VB (2008) Support vector 
regression for censored data (SVRC): a novel 
tool for survival analysis. In: 2008 Eighth IEEE 
international conference on data mining. 
IEEE, pp 863-868 

Vapnik V (1999) The nature of statistical 
learning theory. Springer Science & Business 
Media 

Polsterl S, Navab N, Katouzian A (2015) Fast 
training of support vector machines for survival 
analysis. In: Joint European conference on 
machine learning and knowledge discovery in 
databases. Springer, pp 243-259 

Leger S, Zwanenburg A, Pilz K et al (2017) A 
comparative study of machine learning meth- 
ods for time-to-event survival data for radio- 
mics risk modelling. Sci Rep 7(1):1-11 
Garate-Escamila AK, El Hassani AH, Andrés E 
(2020) Classification models for heart disease 
prediction using feature selection and PCA. 
Informatics Med Unlocked 19:100330 

Ewees AA, Al-qaness MA, Abualigah L et al 
(2021) Boosting arithmetic optimization algo- 
rithm with genetic algorithm operators for fea- 
ture selection: Case study on Cox proportional 
hazards model. Mathematics 9(18):2321 
Schemper M, Kaider A, Wakounig S, Heinze G 
(2013) Estimating the correlation of bivariate 
failure times under censoring. Stat Med 
32(27):4781-4790 

Su Z, Tang B, Liu Z, Qin Y (2015) Multi-fault 
diagnosis for rotating machinery based on 
orthogonal supervised linear local tangent 
space alignment and least square support vec- 
tor machine. Neurocomputing 157:208-222 
Rodrigues D, Pereira LA, Nakamura RY et al 
(2014) A wrapper approach for feature selec- 
tion based on Bat algorithm and optimum- 
path forest. Expert Syst Appl 41(5): 
2250-2258 

Peng H, Long F, Ding C (2005) Feature selec- 
tion based on mutual information criteria of 
max-dependency, max-relevance, and 


69. 


70. 


71. 


72; 


73. 


74. 


75. 


76. 


TL: 


78. 


79. 


80. 


81. 


393 


min-redundancy. IEEE Trans Pattern Anal 
Mach Intell 27(8):1226-1238 


Curtis C, Shah SP, Chin SF et al (2012) The 
genomic and transcriptomic architecture of 
2,000 breast tumours reveals novel subgroups. 
Nature 486(7403):346-352 

Polsterl S (2020) scikit-survival: A library for 
time-to-event analysis built on top of scikit- 
learn. J Mach Learn Res 21(212):1-6 


Van Rossum G, Drake FL (2009) Python 3 ref- 
erence manual. CreateSpace, Scotts Valley, CA 


Kim B, Khanna R, Koyejo OO (2016) Exam- 
ples are not enough, learn to criticize! Criticism 
for Interpretability. In: Advances in neural 
information processing systems, vol 29 


Lundberg SM, Nair B, Vavilala MS, Horibe M, 
Eisses MJ, Adams T, Liston DE, Low DKW, 
Newman SF, Kim J, et al (2018) Explainable 
machine-learning predictions for the preven- 
tion of hypoxaemia during surgery. Nat 
Biomed Eng 2(10):749-760 

Aittokallio T (2010) Dealing with missing 
values in large-scale studies: microarray data 
imputation and beyond. Brief Bioinformatics 
11(2):253-264 

Fryett JJ, Inshaw J, Morris AP, Cordell HJ 
(2018) Comparison of methods for transcrip- 
tome imputation through application to two 
common complex diseases. Eur J Hum Genet 
26(11):1658-1667 

Shahjaman M, Rahman MR, Islam T et al 
(2021) rMisbeta: A robust missing value impu- 
tation approach in transcriptomics and meta- 
bolomics data. Comput Biol Med 138:104911 
Park S, Shin B, Shim WS et al. (2019) Wx: a 
neural network-based feature selection algo- 
rithm for transcriptomic data. Sci Rep 9(1):1-9 
Han Y, Huang L, Zhou F (2021) Zoo: Select- 
ing transcriptomic and methylomic biomarkers 
by ensembling animal-inspired swarm intelli- 
gence feature selection algorithms. Genes 
12(11):1814 

Iuliano A, Occhipinti A, Angelini C et al 
(2021) COSMONET: An R package for sur- 
vival analysis using screening-network meth- 
ods. Mathematics 9(24):3262 

Katzman JL, Shaham U, Cloninger A et al 
(2018) DeepSurv: personalized treatment rec- 
ommender system using a Cox proportional 
hazards deep neural network. BMC Med Res 
Methodol 18(1):1-12 

Poirion OB, Jing Z, Chaudhary K et al (2021) 
DeepProg: an ensemble of deep-learning and 
machine-learning models for prognosis predic- 
tion using multi-omics data. Genome Med 
13(1):1-15 


® 
ae Chapter 17 


Machine Learning Using Neural Networks for Metabolomic 
Pathway Analyses 


Rosalin Bonetta Valentino, Jean-Paul Ebejer, and Gianluca Valentino 


Abstract 


Elucidating the mechanisms of metabolic pathways helps us understand the cascade of enzyme-catalyzed 
reactions that lead to the conversion of substances into final products. This has implications for predicting 
how newly synthesized compounds will affect a person’s metabolism and, hence, the development of novel 
treatments to improve one’s health. The study of metabolomic pathways, together with protein engineer- 
ing, may also aid in the extraction, at a scale, of natural products to be used as drugs and drug precursors. 
Several approaches have been used to correlate protein annotations to metabolic pathways in order to derive 
pathways directly related to specific organisms. These could range from association rule-mining techniques 
to machine learning methods such as decision trees, naive Bayes, logistic regression, and ensemble methods. 
In this chapter, we will be reviewing the use of machine learning for metabolic pathway analyses, with a 
step-by-step focus on the use of deep learning to predict the association of compounds (metabolites) to 
their respective metabolomic pathway classes. This prediction could help explain interactions of small 
molecules in organisms. Inspired by the work of Baranwal et al. (2019), we demonstrate how to build 
and train a deep learning neural network model to perform a multi-label prediction. We considered two 
different types of fingerprints as features (inputs to the model). The output of the model is the set of 
metabolic pathway classes (from the KEGG dataset) in which the input molecule participates. We will walk 
through the various steps of this process, including data collection, feature engineering, model selection, 
training, and evaluation. This model-building and evaluation process may be easily transferred to other 
domains of interest. All the source code used in this chapter is made publicly available at https: //github. 
com/jp-um/machine_learning_for_metabolomic_pathway_analyses. 


Key words Metabolomics, Machine learning, Neural networks, KEGG classes, Feature engineering, 
Performance metrics 


1. Introduction 


1.1 Metabolomics Metabolomics involves the extensive analysis of metabolites con- 
sisting of small molecules (<1 kDa) in an organism or a particular 
biological sample. Such analysis depends on the myriad of biochem- 
ical knowledge attained over the last decades [1]. 

This area of research focuses on the intermediates and products 
of metabolism. These include fatty acids, carbohydrates, 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_17, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


395 


396 


Rosalin Bonetta Valentino et al. 


nucleotides, amino acids, antioxidants, and vitamins among other 
compounds. The metabolome includes all the metabolites that are 
synthesized by a biological system and can be characterized at all 
biological levels including organelles, cells, tissues, and 
organisms [2]. 

Numerous factors have been identified to affect metabolite 
levels within tissues and biological fluids, making the metabolome 
as a whole susceptible to fluctuations determined by genetic and 
environmental factors, gut microflora, as well as enzyme activity. 
Therefore, metabolomics gives us an indication of cell and ulti- 
mately organism health [3, 4]. 

As opposed to genomics, transcriptomics, or proteomics, meta- 
bolomics aims to give us an explanation of the response of organ- 
isms to physiological and pathophysiological stimuli. Hence, 
interest in this field of research has risen exponentially in recent 
years. Metabolomics, therefore, allows us to develop an under- 
standing of the effect of genetic variation, disease, treatment, or 
diet exerted on metabolic state of organisms [5, 6]. The analytical 
methods used in metabolomics mainly include nuclear magnetic 
resonance (NMR) spectroscopy and mass _ spectrometry 
(MS) [7, 8]. Such spectroscopic techniques allow for the analysis 
of numerous small molecules in a sample, and this may involve the 
identification and quantitation of metabolites. 

The three main research approaches taken in metabolomics 
may include metabolic fingerprinting, metabolite profiling, and 
targeted metabolomics. Metabolic fingerprinting is the rapid eval- 
uation of the reproducible metabolite fingerprint of a biological 
sample. The metabolic fingerprint can be considered as the concen- 
tration of metabolites in a sample at a point in time. In this case, 
metabolite identification is not necessary. The aim of the fingerprint 
is to represent numerous compound classes which may be poten- 
tially interesting for applications such as drug discovery. The meta- 
bolites are not known in advance in this case. Metabolic 
fingerprinting does not require any advanced sample preparation 
or chromatographic resolution techniques. Instead it makes use of 
techniques which provide reproducible data. Metabolic fingerprint- 
ing is mostly used for classifying a sample rather than for quantita- 
tive analysis. This may be, for example, to distinguish between 
specimens in a healthy or disease state [9, 10]. 

Metabolite profiling consists of an approach in metabolomics 
which is non-targeted and includes analyzing a vast range of meta- 
bolites without knowing which compounds would be of interest in 
advance. As opposed to fingerprinting, the scope of profiling is to 
identify as well as quantify as many compounds of interest as 
possible via high-throughput metabolite quantification. The latter 
requires chromatographic separation at a high resolution coupled 
with mass spectrometry to enable the detection of new metabolic 
biomarkers [11, 12]. 


1.2 Machine 
Learning 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 397 


Targeted metabolomics consists of analyzing one or more 
metabolites which are predefined and usually allows for their iden- 
tification and quantity within a sample. These compounds are 
selected in advance, depending on their particular metabolic path- 
ways or biomarkers that would be related to a specific reaction in 
the organism. In this case, analytical techniques that provide a high 
sensitivity and selectivity are used to attain low detection limits of 
metabolites [11, 12]. 

Metabolomic research has a variety of health applications which 
range from toxicology, newborn screening, and pharmacology to 
clinical chemistry. Currently, metabolomics is employed to find new 
biomarkers of numerous diseases and to highlight the biochemical 
pathways contributing to their pathogenesis [1, 6, 7]. Besides 
identifying new diagnostic biomarkers which can be utilized to 
detect a disease at an early stage, metabolomic research can be 
applied to find biomarkers that can be used to select an appropriate 
therapy and subsequently evaluate the outcome to the particular 
treatment applied. Thus, metabolomics can be used as a tool in the 
development of personalized medicine. 

This chapter is structured as follows: we first introduce machine 
learning and a particular branch known as deep learning, as well as 
one of the most commonly used architectures — neural networks. 
Applications of deep learning to metabolomics in the literature are 
reviewed. The neural network training procedure which is per- 
formed through backpropagation is also covered. Through an illus- 
trative example, we will then demonstrate how to build and train a 
deep learning neural network to predict the association of metabo- 
lites to their respective metabolomic pathway classes. Like many 
problems in biology, this is a multi-label problem and is well-suited 
to the use of neural networks. 


Machine learning is broadly defined as the capability of an algo- 
rithm to learn a model which can perform some task successfully 
given some performance metric. Supervised learning is a particular 
learning paradigm in which the algorithm is provided with a labeled 
dataset of correct input-output pairs and learns to predict the 
correct output given an input. Other learning paradigms include 
unsupervised learning, in which the ground truth output is not 
provided, and the algorithm therefore needs to discover some 
underlying pattern in the available data, as well as reinforcement 
learning, in which an agent explores an environment according to 
some policy in order to earn rewards. 

An artificial neural network (ANN) [13] is a type of machine 
learning model which can be trained using supervised learning. 
ANNs loosely mimic the behavior of a biological neural network, 
in which neurons are interconnected and are triggered depending 
on the input signal provided and an activation function. Weights are 
associated with each connection between two neurons. The 


398 Rosalin Bonetta Valentino et al. 


1.3 ML Applications 
in Metabolomics 


2 Methods 


2.1 Neural Network 


network architecture follows a sequential structure, in which an 
input signal is propagated throughout the network in a feed- 
forward manner until it reaches the output. Deep learning [14] is 
a term used to describe the training of large neural networks which 
have many neurons and several hidden layers. The purpose of these 
hidden layers is to allow the network to learn nonlinear and convo- 
luted mappings, as well as to extract features from the input data. 


Deep learning has been increasingly used for problems in metabo- 
lomics, which are difficult to solve with conventional algorithms. 
For example, in nuclear magnetic resonance (NMR) and mass 
spectroscopy (MS)-based metabolomics, a variety of ML algo- 
rithms have been developed for data preprocessing, peak identifica- 
tion, peak integration, compound identification/quantification, 
data analysis, and data integration [15-19]. In particular, Baranwal 
et al. [20] use graph convolutional neural networks to extract 
molecular shape features which are then fed to a random forest 
classifier to predict the pathway class for a given molecule. Never- 
theless, the number of deep learning-associated publications in 
metabolomics is still significantly lower than all other omics [21]. 
The uptake of deep learning and neural networks is also increas- 
ing within the metabolomic community due to the availability of 
programming languages such as Python [22], R [23], and 
MATLAB [24] as well as frameworks such as TensorFlow [25], 
Keras [26], PyTorch [27], and scikit-learn [28]. These frameworks 
are designed to run on graphics processing units (GPUs), which 
can parallelize complex tasks (e.g., matrix multiplication) and are 
readily available in desktop computers and computing clusters. 


A neural network is represented using a connected graph of neu- 
rons. An example is shown in Fig. 1, where a neuron Y receives 
inputs from neurons X, Xz, and X3. The outputs from these three 
neurons are x), x), and «3. The net input y,, to the neuron Yis the 
sum of the weighted signals from these neurons: 
Jin = WX] + 2X2 + wW3x3. Further neurons may be then connected 
to Y. The output from Y is your = fin), where f is called the 
activation function. 

A set of commonly used activation functions is shown in Fig. 2. 
In order to achieve optimal performance, an activation function 
should have the following properties: (a) nonlinearity, 
(b) continuously differentiable, (c) monotonic, and 
(d) approximate identity around the origin. 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 399 


Yout 


Fig. 1 An example of a simple neural network 


identi tanh 
1.0 ty 1.0 
0.5 + 0.5 
y=2 
0.07 0.0 . ' 2 
y =tanhz = = 
-0.5 -0.5 1+¢ 
-1.0 -1.0 
-10 -05 00 05 10 -10 -05 00 05 1.0 
ReLU Sigmoid 
1.0 1.0 
os 4 0.8 
0 ifx<0 0.6 1 
f(z) = . 0.0 y= 
xz ift>0 0.4 1+e-2 
-0.5 4 02 
OO 0.0 
-10 -05 00 0s 10 -2 C) 2 


Fig. 2 Commonly used neural network activation functions 


2.2 Training a Neural 
Network 


The training process typically involves three steps: (a) feed-forward, 
(b) backpropagation, and (c) weight adjustment. The architecture 
in Fig. 3 shows a neural network with one hidden layer, with 
weights vy and w. Note the inclusion of a bias neuron in each layer 
(except the output layer), which has an input value of constant 
1. The purpose of the bias neuron is analogous to that of an 
intercept when fitting a line to some data — it allows the activation 
function to be shifted as needed. 

The following is the algorithm which is used to train a neural 
network by backpropagating the error through the network to 
update its weights: 


Step 0: The weights are initialized to small random values (e.g., in 
the range —1 to 1). 

Step 1: While the stopping condition is false, do Steps 2-9. 

Step 2: For each training pair, do Steps 3-8: 


400 


Rosalin Bonetta Valentino et al. 


Vo1 1 Ww 
01 
X, Va Wom 
Y, 
Vo Z, Wi 
p 
V s 
: 1p W 
1m = 
Via s Wo 
W 
pm Y 
Vv os iS 
np p 


X 


n 
Fig. 3 Backpropagation neural network with one hidden layer 
Feedforward: 


Step 3: Each input unit (X;,7= 1, ..., 2) receives input signal x; and 
broadcasts this signal to all units in the layer above (the hidden 


units). 
Step 4: Each hidden unit (Z;, 7= 1, ..., p) sums its weighted input 
signals: 
Zing = V0; + Pa > (1) 
applies its activation function to compute its output signal: 
2) = f (Zing)s (2) 


and sends this signal to all units in the layer above (output units). 


Step 5: Each output unit (Y%, k= 1, ..., m) sums its weighted input 
signals: 


P 
Yin, = Wor + >. gat (3) 


and applies its activation function to compute its output signal: 


Ve =f vine): (4) 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 401 


Backpropagation of error: 

Step 6: Each output unit (Y,, k= 1, ..., m) receives a target pattern 
corresponding to the input training pattern, computes its error 
information term, 


bn = (th -— 9) f" (vina) > (5) 
calculates its weight correction term (used to update wg later): 
AW jp = 6,235 (6) 


where a is the learning rate (which determines the rate at which the 
weights are changed), calculates its bias correction term (used 
to update wo, later) 


Awor = abp, (7) 
and sends 6, to units in the layer below. 


Step 7: Each hidden unit (Z, 7 = 1, ..., p) sums its delta inputs 
(from units in the layer above), 


inj = So 8 its (8) 


multiplies by the derivative of its activation function to calculate its 
error information term, 


/ 
6; — Oingf (23.4) (9) 
calculates its weight correction term (used to update 9; later), 
Avi; = a6; Xi, (10) 
and calculates its bias correction term (used to update 7, later), 


Avo; = a6;. (11) 


Update weights and biases: 
Step 8. Each output unit (%, k = 1, ..., m) updates its bias and 
weights: 


(G = 0, ...,p) : Wie = Wie + Awye (12) 


Each hidden unit (Z, 7 = 1, ..., p) updates its bias and weights 
(4=0,..., 2): 


Vig = Vij + Aj. (13) 


Step 9. Test stopping condition. The stopping condition can be 
defined such that the training algorithm terminates once the 
change in the weights goes below a certain threshold, e.g., 
10-°. This ensures that the weights would have converged to 
some stable values. 


402 Rosalin Bonetta Valentino et al. 


3 Illustrative Example 


3.1 Dataset In order to train a neural network to predict the constituent path- 
way class(es) for a given metabolite, we obtained a dataset of 6669 
metabolites from Baranwal et al. [20]. This dataset was assembled 
from the Kyoto Encyclopedia of Genes and Genomes (KEGG) 
database [29]. Each metabolite is labeled as belonging to one or 
more classes, summarized in Table 1. 


3.2 Dataset The dataset consisting of the metabolites with their classes are 
Preparation and passed through a preprocessing procedure as shown in Fig. 4. 
Feature Engineering This preparation process is crucial for building effective 


machine learning models. This consists of the following three steps: 


1. Standardization: The metabolites may originate from different 
sources. This implies that the molecules themselves may be 
represented in different ways (e.g., with/without salts, differ- 
ent protonation states, different tautomeric states, etc.). Mole- 
cules in the dataset should be represented in a standard 
manner; otherwise, the same molecular entity may give rise to 
different representations (and descriptors used in our models). 
This also removes molecules in our dataset that are of little 
interest (e.g., single-atom entries). 

2. Clustering: Some of the metabolites may be similar to each 
other and would artificially inflate the performance of the 
machine learning models. For example, two metabolites, one 
in the training and the other in the testing set having the same 


Table 1 
List of KEGG pathway database classes 


Class ID Class name 


Carbohydrate metabolism 

Energy metabolism 

Lipid metabolism 

Nucleotide metabolism 

Amino acid metabolism 
Metabolism of other amino acids 
Glycan biosynthesis and metabolism 


Metabolism of cofactors and vitamins 


Ne cs PS cs Pe] &S Pe Pe 


Metabolism of terpenoids and polyketides 


a 
=) 


Biosynthesis of other secondary metabolites 


~ 
— 


Xenobiotic biodegradation and metabolism 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 403 


Metabolites 


Dataset Preparation 


Descriptor 
Generation 


Model Building 
St 


Fig. 4 Data preprocessing procedure 


classes, may differ (only) by a methyl. The machine learning 
algorithm would be able to easily classify the testing set mole- 
cule, since it has been training using an almost identical mole- 
cule. In virtual screening, this is referred to as analogue bias 
[30]. Clustering is performed to remove these similar mole- 
cules (as well as identical ones). Only a representative molecule 
from each cluster is used as input to the machine learning 
models. 


3. Descriptor Generation: There are many ways in which a metab- 
olite may be represented. This could be a vector of physico- 
chemical properties (e.g., molecular weight, hydrogen bond 
donors, hydrogen bond acceptors, etc.), the topology of a 


404 Rosalin Bonetta Valentino et al. 


3.2.1. Standardization 


molecule (e.g., the graph or fingerprint representation), 3D 
descriptors which use the shape (conformation) of a molecule 
for representation, and other higher-dimensional descriptors 
(e.g., 3D + charge). The representation used may have a severe 
impact on the accuracy of the model. The suitable representa- 
tion is typically selected via experimentation. 


We implement these standardization, clustering, and descriptor 


generation steps using RDKit [31], a popular, free, and open- 
source cheminformatics toolkit. 


The main aim of our standardization process is twofold: (i) to 
enforce consistency across our dataset and results by representing 
all molecules in a uniform way and (1i) to make sure all molecules 
used to train and test the models are of high quality and error-free 
(e.g., incorrect valencies, etc.). We start by loading the molecules 
from a SMILES file [32], and we sanitize each molecule using 
RDKit. Among other functionality, this step includes the following: 


1. 


Corrects a number of nonstandard valance states (e.g., N 


(=0) = O - > [N+](=0)[0-]). 


. Calculates and checks explicit and implicit valences on all 


atoms. 


. Converts aromatic rings to their kekule form. Errors are raised 


if a ring cannot be kekulized or if aromatic bonds are found 
outside rings. 


. Identifies aromatic rings and ring systems (sets bond orders to 


aromatic). 


. Identifies conjugation in a molecule. 


6. Removes chiral tags from atoms that are not sp3 hybridized. 


. Addition of explicit hydrogen atoms to preserve chemistry 


(e.g., in heteroatoms in aromatic rings). 


The output of this sanitization is a list of molecules which are 


consistent with each other and may be read into RDKtt for proces- 
sing. Each valid molecule is passed through the standardization 
process. This includes multiple steps, which include the following: 


1. 


. Disconnect metal atoms. 


nan ek Ww bd 


Remove hydrogen atoms. 


. If many fragments are present, take the largest fragment. 
. Normalize functional groups. 


. Neutralize charges on the molecule, and then reionize com- 


mon functional groups in a standard way (note that RDKit 
does not have a pKa calculator and no attempt is made at 
ionization at some pH). 


. Canonicalize the tautomeric representation. 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 405 


3.2.2 Clustering Analysis 


The original dataset contains a number of molecules with 
dummy atoms (denoted with * in SMILES). We replace these 
dummy atoms with a hydrogen atom. We also remove all molecules 
with less than five atoms. 


Clustering allows us to take representative molecules from our 
dataset. This ensures that the performance of our models is eval- 
uated in a more realistic way. Our clustering removes similar (and 
identical) molecules which are found in the dataset, making a more 
objective performance evaluation since the testing set will not 
contain trivially similar molecules as present in the training set. In 
this study, we use Butina clustering [33] on the small molecules and 
take the centroid of each cluster while discarding the other mem- 
bers of each cluster. This reduced our dataset from 6669 to 2171 
molecules. There are three steps to clustering analysis using Butina: 


1. Generation of fingerprints 
2. Identification of potential cluster centroids 


3. Clustering based on exclusion spheres 


Generation of fingerprints In the original work by Darko Butina, 
a 1024-bit fingerprint was generated using Daylight. To make the 
work reproducible using open-source software, we decided to gen- 
erate 1024-bit Morgan fingerprints with a radius of 2 (roughly 
equivalent to ECFP4) using RDKtt. 


Identification of potential cluster centroids The idea is that the 
molecules in a cluster with the largest number of neighbors (i.e., 
similar molecules) are most representative of the cluster as they are 
most like other members in their group. To compute the potential 
centroids, a similarity threshold is chosen (in our case 0.4). This 
threshold is then used to compute neighbors in the set (anything 
more similar than this threshold is considered as a neighbor). A 
sorted list of molecules, by descending number of neighbors, in the 
set is maintained. This is required for the algorithm to be deter- 
ministic (i.e., gives the same results every time it is run). 


Clustering based on exclusion spheres Starting from the first mole- 
cule in the list (i.e., the one with most neighbors), calculate its 
similarity to all other molecules in the set in a pairwise fashion. All 
those molecules with a similarity equal to or higher than the 
selected threshold become members of the same cluster (with the 
original molecule from the sorted list being the centroid of the 
cluster). This is known as an exclusion sphere (on the known 
cluster). This set of similar molecules forming a cluster is now 
ignored and cannot form part of another cluster (or act as a cen- 
troid). If a molecule in the sorted list has no similar neighbors 
(either all molecules have a similarity lower than the chosen 


406 Rosalin Bonetta Valentino et al. 


3.2.3 Descriptor 
Generation 


3.3 Model Setup and 
Training 


threshold or the similar molecules have been earlier assigned to 
another cluster), then it forms a singleton (a cluster with a single 
member). The Butina algorithm is known to generate many 
homogenous clusters. 


In order to build supervised machine learning models, we need to 
be able to represent molecules in some computer-readable way. 
These representations, or “descriptors,” describe molecular proper- 
ties using a set of numerical or categorical variables. There are many 
possible descriptors (e.g., Ultrafast Shape Recognition uses the 3D 
shape of a molecule to generate a vector of 12 real numbers 
[33, 34]), and different types of descriptors may affect the perfor- 
mance of our models. A common approach is to use a list 
(or vector) of numbers as a fingerprint to represent a molecule. 
Fingerprints may either be binary in nature, containing a zero or 
one in every position recording the absence or presence of a feature 
in a molecule, or else have numerical representations (such as 
counts of particular chemical moiety). In this work, we use two 
different fingerprint descriptors: extended-connectivity circular fin- 
gerprint (ECFP) [35] and molecular access system (MACCS) 
[36]. ECFP assigns each non-hydrogen atom in the molecule an 
identifier based on six atomic properties (valence, number of imme- 
diate non-hydrogen atoms, etc.). A radius parameter (2 in our case) 
is specified during the fingerprint generation which defines the 
atomic neighborhood to consider in an iterative manner. Each 
iteration captures a larger atomic neighborhood for each atom. 
These environments are then hashed in a fixed length binary list 
which records their presence. The length of our ECFP fingerprint is 
1024 bits. MACCS fingerprints are composed of an ordered binary 
list of 166 structural keys (e.g., does the molecule contain a 
Cl atom?). We generate these two fingerprints for each of our 
2171 metabolites to use as input to our machine learning models. 


As the inputs to the model consist of two binary vectors of lengths 
166 and 1024 respectively, for each metabolite, feature normaliza- 
tion is not required. 

We then randomly split the dataset into 80% (1737 samples) 
training and 20% (434 samples) test data using a stratified split 
approach which seeks to ensure that each class is represented with 
the same ratio between the train and the test sets. This was achieved 
using the zterative_train_test_split function within the skmultilearn 
[37] Python package. 

The Keras API library [26] was used to set up two neural 
network architectures (one for each type of binary fingerprint vec- 
tor) which could predict the constituent pathway class(es) for a 
given metabolite. The number of neurons in the input and output 
layer is fixed by the dimensionality of the input features and the 
output classes, while the number of hidden layers and the number 


Table 2 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 


List of neural network hyperparameters and their ranges 


407 


Hyperparameter 


Values 


Number of hidden layers and neurons in (128, 64), (256, 64), (256, 128, 64), (512, 128), 


each layer 


Learning rate 


(512, 256, 128) 
Il % 108, Il x 1G", Il x 1O= 


Activation function ReLU, tanh, sigmoid 


Optimizer 


Adam, stochastic gradient descent, RMSProp 


of neurons in each hidden layer form part of the set of hyperpara- 
meters for the models. As opposed to the model parameters 
(or weights) which are learned during the training process, hyper- 
parameters are tunable parameters which are used to control the 
training process. They are usually established by rule of thumb or 
else through optimization procedures, which seek to determine the 
set of hyperparameter values that enable the model to achieve the 
best performance. Other neural network hyperparameters include 
the learning rate, the optimizer to be used, and the activation 


function. 


Hyperparameter optimization was carried out via grid search. 
With this technique, a scan is performed over various hyperpara- 
meters, with the best set of hyperparameters being determined via 
the per-class accuracy as explained in Subheading 3.4. A list of the 
hyperparameters and their corresponding ranges is shown in 


Table 2. 


K-fold cross-validation (with K = 5) was also performed in 
conjunction with hyperparameter optimization. The validation 
process involves not using part of the dataset for training, which 
may pose a problem of underfitting. Therefore, in K-fold cross- 
validation, the training dataset is partitioned into K equally sized 
subsets (which contain samples drawn randomly), and the model is 
trained on the remaining data and evaluated on the K‘” subset. The 
performance is then computed over the K subsets, which allows for 


better generalization. 


Diagrams of the architectures used are shown in Figs. 5 and 6, 
showing the final neural network structures which gave the best 
performance. A learning rate of 1 x 10~* was used. The Rectified 
Linear Unit (ReLU) activation function was used throughout the 
network except for the final output layer, for which the sigmoid 
function was used. This is necessary in order to obtain a multi-label 
output. The binary cross-entropy (or log-loss) loss function was 


used, which is given by: 


H,(q) = oN yi. log (p(9;)) 
saa oe y;)-log (1 — p(y;)) 


(14) 


408 


Rosalin Bonetta Valentino et al. 


256 
167 


64 


11 


Fig. 5 Neural network architecture for the MACCS dataset resulting in a total of 
around 60,000 weights 


1024 


512 


128 
11 


Fig. 6 Neural network architecture for the Morgan dataset, resulting in a total of 
around 1.64 million weights 


where N is the number of samples, y is the label, and p(y) is the 
predicted probability of the sample being correct. Finally, the Adam 
optimization algorithm [38] was used to update the network 
weights. This algorithm combines the advantages of two other 
optimization algorithms which in turn are extensions of stochastic 
gradient descent, namely, adaptive gradient (AdaGrad) algorithm 
and root mean squared propagation (RMSProp). 


3.4 Model 
Performance 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 409 


The performance of the trained neural networks was evaluated 
using the 20% unseen testing dataset. As the predictor is a multi- 
label one, we cannot compute a global accuracy, but instead we 
obtain results on a per-class basis. Apart from the accuracy, which 
represents the fraction of correct predictions made by the model, 
the precision (the fraction of relevant instances among the retrieved 
instances) and recall (the fraction of relevant instances that were 
retrieved) are also computed: 


_ TP+ TN 
Accuracy TP IN + EP TEN (15) 
wos TP 
Precision = =p pp (16) 
TP 
Recall = TP EN (17) 


Figure 7 shows the per-class accuracy, precision, and recall 
obtained on the MACCS features. The model generally achieves a 
good accuracy; however, it performs less well with regard to preci- 
sion and recall, in particular for classes 2 and 7. 

An explanation of the poorer performance in terms of precision 
and recall can be found in Fig. 8, which shows how another metric, 
Fl-score, which combines both precision and recall as follows: 


dx precision x recall 


Fl — score = — 
precision + recall 


(18) 


Mmm Accuracy 
Mm Precision 
Mmm Recall 


Class 


Fig. 7 Per-class accuracy, precision, and recall obtained on the unseen testing set for MACCS distinct 


410 


Fl1-score 


Rosalin Bonetta Valentino et al. 


0.8 


o 
ron) 


o 
KR 


0.2 


0.0 


20 40 60 80 100 
Number of samples 


Fig. 8 Linear fit applied to the F1-score obtained on the unseen testing set for MACCS distinct as a function of 
the number of “1” labels per class 


varies with the number of samples (i.e., number of instances when 
the prediction should be “1”) for each class. 

Informally, the hamming loss is the fraction of labels that are 
incorrectly predicted. The hamming loss is formally defined as: 


HL es DXi, (19) 


where X;, ; is the target, Yj, ; is the prediction, — denotes the 
exclusive or operator which returns 0 when the target and predic- 
tion are identical and one otherwise, L is the number of labels, and 
Nis the number of samples. As this is a loss function, the optimal 
value is zero. A hamming loss of 0.0667 was obtained for the 
MACCS features. 

Visual demonstrations of the model performance can be 
obtained through receiver operating characteristic (ROC) and 
precision-recall (PR) curves shown in Figs. 9 and 10, respectively. 
An ROC curve is obtained by plotting the true positive rate 
obtained as a function of the false-positive rate, as the classifier’s 
discrimination threshold is varied. A good model obtains a high 
true positive rate for a corresponding low false-positive rate, result- 
ing in graphs which go through the top left hand corner of the plot. 
A random classifier would result in the dotted red line with the true 
positive rate equal to the false-positive rate. The area under the 
curve (AUC) is another metric that can be obtained from ROC 
curves, with a perfect classifier having an AUC of 1. 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 


True Positive Rate 


0.2 0.4 0.6 


False Positive Rate 


0.8 1.0 


411 

No skill 

Class 1: AUC = 0.92 
Class 2: AUC = 0.93 
Class 3: AUC = 0.95 
Class 4: AUC = 0.93 
Class 5: AUC = 0.89 
Class 6: AUC = 0.87 
Class 7: AUC = 0.94 
Class 8: AUC = 0.92 
Class 9: AUC = 0.96 


Class 10: AUC = 0.89 
Class 11: AUC = 0.93 


Fig. 9 Per-class receiver operating characteristic curves showing the area under the curve (AUC) obtained 


Precision 
°o 
[°)) 


° 
BR 


° 
N 


0.0 


0.0 


0.2 0.4 0.6 


Recall 


0.8 1.0 


No skill 

Class 1: AUC = 0.71 
Class 2: AUC = 0.34 
Class 3: AUC = 0.61 
Class 4: AUC = 0.43 
Class 5: AUC = 0.61 
Class 6: AUC = 0.48 
Class 7: AUC = 0.05 
Class 8: AUC = 0.73 
Class 9: AUC = 0.81 


Class 10: AUC = 0.82 
Class 11: AUC = 0.89 


Fig. 10 Per-class precision recall curves showing the area under the curve (AUC) obtained 


On the other hand, a PR curve is generated by plotting the 
precision as a function of the recall. In this case, the ideal curve 
passes through the top right-hand corner. A random classifier 
would result in a recall of zero with varying precisions. 

Figures 11 and 12 show the per-class accuracy, precision, and 
recall obtained for the Morgan features, as well as the Fl-score 
varying with the number of samples. A hamming loss of 0.0766 was 
obtained. The per-class ROC and PR curves are shown in Figs. 13 


and 14, respectively. 


412 Rosalin Bonetta Valentino et al. 


Mmm Accuracy 
Mm Precision 


Class 


Fig. 11 Per-class accuracy, precision, and recall obtained on the unseen testing set for Morgan distinct 


oS 9S 9° 
Oo N 


© 
uw 


Fl1-score 


0.4 


0 20 40 60 80 100 
Number of samples 


Fig. 12 Linear fit applied to the F1-score obtained on the unseen testing set for Morgan distinct as a function of 
the number of “1” labels per class 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 413 


Class 1: AUC = 0.89 
Class 2: AUC = 0.93 
Class 3: AUC = 0.90 
Class 4: AUC = 0.92 


Class 5: AUC = 0.86 
Class 6: AUC = 0.73 
Class 7: AUC = 0.91 
Class 8: AUC = 0.93 
Class 9: AUC = 0.96 
Class 10: AUC = 0.91 
Class 11: AUC = 0.95 


True Positive Rate 


0.0 0.2 0.4 0.6 0.8 1.0 
False Positive Rate 


Fig. 13 Per-class receiver operating characteristic curves showing the area under the curve (AUC) obtained 


No skill 

Class 1: AUC = 0.47 
Class 2: AUC = 0.33 
Class 3: AUC = 0.72 
Class 4: AUC = 0.50 
Class 5: AUC = 0.60 
Class 6: AUC = 0.20 
Class 7: AUC = 0.33 
Class 8: AUC = 0.77 
Class 9: AUC = 0.89 
Class 10: AUC = 0.81 
Class 11: AUC = 0.89 


Precision 


0.0 0.2 0.4 0.6 0.8 1.0 
Recall 


Fig. 14 Per-class precision recall curves showing the area under the curve (AUC) obtained 


4 Conclusion 


Understanding the mechanisms and structural mappings between 
molecules and pathway classes is an important step toward design of 
reaction predictors for synthesizing new molecules. In this chapter, 
we provided an in-depth look at machine learning and neural net- 
works and demonstrated how these techniques can be applied to 
the problem of predicting the association of metabolites to their 


414 


Rosalin Bonetta Valentino et al. 


respective metabolomic pathway classes. The dataset preparation 
procedure, including standardization, clustering, and descriptor 
generation, is presented, and we also discuss a number of metrics 
and methods which can be used to evaluate the performance of the 
model. The Python code for this chapter is made available at 
https://github.com/jp-um/machine_learning_for_metabolomic_ 


pathway_analyses. 


References 


1. 


2. 


w 


10. 


ll. 


12. 


Nicholson JK, Lindon JC (2008) Systems biol- 
ogy: metabonomics. Nature 455:1054 
Fiehn O (2002) Metabolomics — the link 


between genotypes and phenotypes. Plant 
Mol Biol 48:155 


. Holmes E, Wilson ID, Nicholson JK (2008) 


Metabolic phenotyping in health and disease. 
Cell 134:714 


. Vermeersch KA, Styczynski MP (2013) Appli- 


cations of metabolomics in cancer research. J 
Carcinog 12:9 


. Kraj A, Drabik A, Silberring (2010) Nowe 


podejscie w oznaczaniu i identyfikacji mikroor- 
ganizmow (Polish). Wydawnictwa Uniwersy- 
tetu Warszawskiego, Warszawa, pp 1-4—-15-18 


. Bu Q, Huang YN, Yan GY, Cen XB, Zhao YL 


(2012) Metabolomics: a revolution for novel 
cancer marker identification. Comb Chem 
High Throughput Screen 15:266 


. Spratlin JL, Serkova NJ, Eckhardt SG (2009) 


Clinical applications of metabolomics in oncol- 
ogy: a review. Clin Cancer Res 15:431 


. Gika HG, Theodoridis GA, Plumb RS, Wilson 


ID (2014) Current practice of liquid 
chromatography-mass spectrometry in meta- 
bolomics and metabonomics. J Pharm Biomed 
Anal 87:12 


. Blekherman G, Laubenbacher R, Cortes DF, 


Mendes P, Torti FM et al (2011) Bioinformat- 
ics tools for cancer metabolomics. Metabolo- 
mics 7:329 

Ellis DI, Dunn WB, Griffin JL, Allwood JW, 
Goodacre R (2007) Metabolic fingerprinting 
as a diagnostic tool. Pharmacogenomics 8: 
1243 

Drexler DM, Reily MD, Shipkova PA (2011) 
Metabolomics guides rational development of 
a simplified cell culture medium for drug 
screening against Trypanosoma brucei. Anal 
Bioanal Chem 399:2645 

Schuhmacher R, Krska R, Weckwerth W, 
Goodacre R (2013) Metabolomics and metab- 
olite profiling. Anal Bioanal Chem 405:5003 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


McCulloch W, Pitts W (1943) A logical calcu- 
lus of the ideas immanent in nervous activity. 
Bull Math Biophys 5:115-133 

Le Cun Y, Bengio Y, Hinton G (2015) Deep 
learning, 436. Nature 521 

Cambiaghi A, Ferrario M, Masseroli M (2016) 
Analysis of metabolomic data: tools, current 
strategies and future challenges for omics data 
integration. Brief Bioinform 18(3):498-510 
Smith R, Ventura D, Prince JT (2013) LC-MS 
alignment in theory and practice: a comprehen- 
sive algorithmic review. Brief Bioinform 16(1): 
104-117 

Alonso A, Marsal S, Julia A (2015) Analytical 
methods in untargeted metabolomics: state of 
the art in 2015. Front Bioeng Biotechnol 3:23 
Nguyen DH, Nguyen CH, Mamitsuka H 
(2018) Recent advances and prospects of 
computational methods for metabolite identi- 
fication: a review with emphasis on machine 
learning approaches. Brief Bioinform 20(6): 
2028-2043 

Puchades-Carrasco L, Palomino-Schatzlein M, 
Perez-Rambla C et al (2015) Bioinformatics 
tools for the analysis of NMR metabolomics 
studies focused on the identification of clini- 
cally relevant biomarkers. Brief Bioinform 
17(3):541-552 

Baranwal M, Magner A, Elvati P et al (2020) A 
deep learning architecture for metabolomic 
pathway prediction. Bioinformatics 36(8): 
2547-2553 

Pomyen Y, Wanichthanarak K, Poungsombat P 
et al (2020) Deep metabolome: applications of 
deep learning in metabolomics. Comput Struct 
Biotechnol J 18:2818-2825 

Chollet F (2017) Deep learning with python. 
Manning Publications Co 

Chollet F, Allaire JJ (2018) Deep learning with 
R. Manning Publications Co. 

Kim P (2017) MATLAB deep learning: with 
machine learning, neural networks and artificial 
intelligence. Apress 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


Machine Learning using Neural Networks for Metabolic Pathway Analyses 


Abadi M, Barham P, Chen J et al (2016) Ten- 
sorflow: a system for large-scale machine 
learning. Proc. 12th USENIX Symposium on 
Operating Systems Design and 
Implementation 

Chollet F. Keras. https: //keras.io. Accessed 6th 
Jan 2022 

Pazke A, Gross S, Massa F et al (2019) 
PyTorch: an imperative style, high- 
performance deep learning library. Adv Neural 
Inform Proc Syst 32:8024—8035 

Pedregosa F, Varoquaux G, Gramfort A et al 
(2019) Scikit-learn: machine learning in 
Python. J Mach Learn Res 12:2825- 
2830, 2011 

KEGG Pathway Database. Available at: 
https: //www.genome.jp/kegg/pathway.html. 
Accessed 6th Jan 2022 

Good AC, Oprea TI (2008) Optimization of 
CAMD techniques 3. Virtual screening enrich- 
ment studies: a help or hindrance in tool selec- 
tion? J Comput-aided Mol Des 22:169-178 
RDKit: Open-source cheminformatics, avail- 
able at https://www.rdkit.org. Accessed 6th 
Jan 2022 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


415 


Weininger D (1988) SMILES, a chemical lan- 
guage and information system. 1. Introduction 
to methodology and encoding rules. J Chem 
Inf Comput Sci 28(1):31-36 

Butina D (1999) Unsupervised data base clus- 
tering based on daylight’s fingerprint and tani- 
moto similarity: a fast and automated way to 
cluster small and large data sets. J Chem Inf 
Comput Sci 39(4):747-750 

Ballester PJ, Richards WG (2007) Ultrafast 
shape recognition for similarity search in 
molecular databases. Proc R Soc A Math Phys 
Eng Sci 463:1307-1321 

Rogers D, Hahn M (2010) Extended- 
connectivity fingerprints. J Chem Inf Model 
50(5):742-754 

Durant JL, Leland BA, Henry DR, Nourse JG 
(2002) Reoptimization of MDL keys for use in 
drug discovery. J Chem Inform Comp Sci 42: 
1273-1280 

Szymanski P, Kajdanowicz T (2017) A scikit- 
based Python environment for performing 
multi-label classification. arXiv:1702.01460 
Kingma DP, Ba J (2015) Adam: a method for 
stochastic optimization. arXiv:1412.6980 


Check for 
updates 


Machine Learning and Hybrid Methods for Metabolic 
Pathway Modeling 


Miroslava Cuperlovic-Culf, Thao Nguyen-Tran, and Steffany A. L. Bennett 


Abstract 


Computational cell metabolism models seek to provide metabolic explanations of cell behavior under 
different conditions or following genetic alterations, help in the optimization of in vitro cell growth 
environments, or predict cellular behavior in vivo and in vitro. In the extremes, mechanistic models can 
include highly detailed descriptions of a small number of metabolic reactions or an approximate represen- 
tation of an entire metabolic network. To date, all mechanistic models have required details of individual 
metabolic reactions, either kinetic parameters or metabolic flux, as well as information about extracellular 
and intracellular metabolite concentrations. Despite the extensive efforts and the increasing availability of 
high-quality data, required in vivo data are not available for the majority of known metabolic reactions; 
thus, mechanistic models are based primarily on ex vivo kinetic measurements and limited flux information. 
Machine learning approaches provide an alternative for derivation of functional dependencies from existing 
data. The increasing availability of metabolomic and lipidomic data, with growing feature coverage as well as 
sample set size, is expected to provide new data options needed for derivation of machine learning models of 
cell metabolic processes. Moreover, machine learning analysis of longitudinal data can lead to predictive 
models of cell behaviors over time. Conversely, machine learning models trained on steady-state data can 
provide descriptive models for the comparison of metabolic states in different environments or disease 
conditions. Additionally, inclusion of metabolic network knowledge in these analyses can further help in the 
development of models with limited data. 

This chapter will explore the application of machine learning to the modeling of cell metabolism. We first 
provide a theoretical explanation of several machine learning and hybrid mechanistic machine learning 
methods currently being explored to model metabolism. Next, we introduce several avenues for improving 
these models with machine learning. Finally, we provide protocols for specific examples of the utilization of 
machine learning in the development of predictive cell metabolism models using metabolomic data. We 
describe data preprocessing, approaches for training of machine learning models for both descriptive and 
predictive models, and the utilization of these models in synthetic and systems biology. Detailed protocols 
provide a list of software tools and libraries used for these applications, step-by-step modeling protocols, 
troubleshooting, as well as an overview of existing limitations to these approaches. 


Key words Metabolism modeling, Hybrid modeling, Metabolomics, Lipidomics, Flux analysis, 
Machine learning 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_18, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


417 


418 


1 


Miroslava Cuperlovic-Culf et al. 


Introduction 


Whether for production of biologics or bioremediation in meta- 
bolic engineering, understanding different metabolic states under 
physiological and disease conditions to identify new therapeutic 
targets or for predictive modeling of cell behavior in a changing 
environment, computer modeling of cell metabolism provides an in 
silico platform to test optimal culture conditions, intervention, or 
impact of target engagement. Such models have been used to 
advantage in multiple biopharmaceutical applications [1], drug 
target identifications [2], toxicogenomics including comparison 
of animal and human cell response [3], and, as detailed, kinetic 
models of simple cell systems, including red blood cells (erythro- 
cytes) [4] and platelets [5]. These models can be further expanded 
into major biotechnology platforms designed to optimize the engi- 
neering of CHO cells for biologics [6] and HEK293 cells for 
vaccine particle production [7] and characterize the metabolic 
changes that influence pluripotency and stem cell fate [8]. 

Classical, mechanistic, cell metabolism models, generally, are 
either dynamic models that include detailed kinetic information for 
a limited number of reactions or steady-state, constrained models 
that simulate stationary behavior of a larger cellular, tissue, or 
organismal system [9]. These models are built based on biological 
knowledge and only for known metabolic reactions where subsets 
of reaction or flux parameters are optimized using data to fit specific 
conditions. Kinetic models allow dynamic simulation of the change 
in the system over time; constrained models assume the system is in 
steady-state, thus, only allowing simulation of the flux through 
reactions with the assumption of constant metabolite concentra- 
tions on the simulation timescale. When choosing between these 
extremes, the modeler is faced with a trade-off between the size of 
the model and the level of detail provided by the predicted 
solutions. 

Different combinations of methods have been proposed to 
model metabolism including efforts to develop a genome-scale 
kinetic model combining large network coverage with detailed 
reaction and metabolite concentration analysis (reviewed in 
[10, 11]). Bringing together different types of mechanistic models, 
however, attempts to alleviate the deficits of constraint-based mod- 
els given their lack of information about dynamic metabolite con- 
centration and enzyme regulation while optimizing the kinetic 
framework to reduce shortcomings associated with nonlinearity, 
parameter identifiability, and uncertainty. Although these com- 
bined approaches can bring metabolism modeling closer to the 
optimal large scale, they fully depend on a priori biological knowl- 
edge. Moreover, the reality is that they will encompass multiple 
unknown parameters that require optimization or testing for 


Hybrid Methods for Metabolic Pathway Modeling 419 


on colin econ Data driven network determination 


i Theory inspired ML models 
define kinetic rate equations NLP literature analysis 
if ML functional analysis from data 
obtain and assign kinetic parameters a . 
and initial concentrations a soe een Eee 
Parameter optimization Modeling and optimization 
Kinetic modelling ML ODE solvers 
Sensitivity analysis ML ODE solvers 


Fig. 1 Examples of possible contributions of ML in different steps of mechanistic model development. Although 
ML can be used for building of the complete model or for combination of model and experimental data, it can 
also help in determination of parameters and optimization of specific steps in development and utilization of 
mechanistic models. NLP natural language processing, ODE ordinary differential equations 


outcome across many combinations and large ranges of values. 
Such hybrid models are, by their very nature, both computationally 
demanding and data-intensive. The application of machine learning 
(ML) methods to these models can address some of these issues. 
ML enables various types of data to be used simultaneously as well 
as provide more appropriate data-driven approaches that can pro- 
vide more efficient parameter searches and more accurate, unbiased 
data-driven modeling. Thus, ML can both contribute to specific 
steps in the mechanistic model development, as outlined in Fig. 1, 
and present new global approaches for the expansion of hybrid 
methods combining both constrained and kinetic modeling 
described below. 

ML combines sets of algorithms that develop predictive models 
through experience, i.e., through learning and functional generali- 
zation from data. ML models can be developed only from the data 
and do not require any prior knowledge; however, they also benefit 
from inclusion of domain knowledge that can optimize ML meth- 
odology for specific applications. In this way, prior knowledge can 
reduce training data needs. ML methods can also contribute to 
individual steps in the development of a mechanistic model (Fig. 1 
shows some possible applications). In this context, ML is not used 
in modeling but helps to gather information, optimize parameters, 
or provide better solvers for differential equations [12]. Alterna- 
tively, ML can be further integrated into mechanistic models to 


420 Miroslava Cuperlovic-Culf et al. 


1.1. From 
Mechanistic to ML 
Models, There, and 
Back Again 


provide analysis of the results or include theoretical information for 
development of knowledge-inspired metabolic ML models. There- 
fore, combined ML/mechanistic methods into “hybrid” cell 
metabolism models can augment the mechanistic knowledge 
about metabolism and the kinetics of metabolic reactions with 
data-driven methods for describing unknown parts of the system 
or for describing, more effectively, the underlying complexity of the 
system. These hybrid models are a very recent innovation, with 
great potential to provide new insight in metabolism and its influ- 
ence on organism and cellular fate. 


The first hybrid model in systems biology was presented in 2010 
[13], yet this potentially transformative approach remains in its 
infancy due to the complexity of the problem and the lack of 
appropriate data for most applications. In the most general case, 
mathematical modeling attempts to combine both internal and 
external metabolic reactions and interactions with ultimate goal to 
provide simulation of the complete metabolic network in all its 
detail, including complete metabolic pathways and individual reac- 
tions as well as activation and inhibition with formal, numerical 
representation providing as high level of accuracy and detail as 
achievable within our current level of information. For well- 
described systems and reactions, it is possible to develop highly 
accurate, mechanistic models, presenting detailed dynamic reaction 
information and providing the change in metabolism over time via 
differential equations allowing inclusion of the effects of inhibitors 
or activators of enzyme functions. The increasing availability of 
longitudinal omics data will allow optimization of kinetic para- 
meters in these models. However, for the majority of reactions, 
this level of information is not available, and modeling is only 
possible using approximations of kinetic process simulation (e.g., 
Michaelis-Menten equation) or by reducing studies by assuming 
steady state and constraining potential responses, thereby making it 
possible to model a larger number of reactions. Hybrid models 
employing ML have been fueled by the increasing availability of 
large amounts of biomolecular data. ML models increase calcula- 
tion speed, but, even more importantly, ML can assist in creating 
models for systems for which there is limited knowledge via a data- 
driven approach. ML methods can furthermore be used to combine 
data from different sources including multiomic data, enhance 
mechanistic models by providing additional in silico data, and 
optimize methods for parameter determination. ML methods can 
help in building and executing simulations to test outcome. 
Kinetic models of metabolism integrate enzyme regulation and 
multiomic data with reaction network information to provide 
dynamic analyses and predictions of metabolite concentrations. 
These models present mechanistic representation of the processes 
in cells defined as a series of ordinary differential equations (ODEs) 


Hybrid Methods for Metabolic Pathway Modeling 421 


and include details of rate expression and kinetic parameters to 
estimate dynamic behavior of each reaction in the model. The 
mathematical form of the model is shown in Eq. 1: 

ac; . 

Fe SVE, &)s t= 1,2...0 (1) 
where c; is the concentration of metabolite z, Sis the stoichiometric 
matrix, and V is the vector representation of reaction flux that 
depends on E (the enzyme abundance), C (metabolite concentra- 
tions), and e (kinetic parameters for the reaction). Equations are 
written for each metabolite in the system, requiring knowledge of 
appropriate parameters for each and every reaction. The sets of 
ODEs are then solved often using various approximation methods, 
including as two examples Michaelis-Menten or Hill kinetic equa- 
tions [14]. As a result, the majority of kinetic models focus on a 
small subset of reactions within specific pathways. Kinetic models 
have been developed for a number of metabolic pathways in differ- 
ent organisms and are made available through dedicated reposi- 
tories (listed in Table 1). While useful and effective, the possibility 
to develop large, genome-scale kinetic models remains challenging 
given issues of kinetic model nonlinearity, computational tractabil- 
ity, parameter identifiability, estimability, and uncertainty [10]. 

While kinetic information is available for a number of enzymes 
in several detailed databases [15, 16] (reviewed in Table 1), the 
majority of kinetic constants have been measured ex vivo. Without 
empirical validation, it is possible that they inadequately represent 
the in vivo situation. More accurate determination of kinetic para- 
meters requires optimization from data; however, models generally 
have problems in identifying and optimizing large numbers of 
parameters given nonlinear mechanistic rate equations. Simplified 
kinetic models have been explored for different applications by 
either reducing the size of the pathway space or simplifying kinetic 
equations. Such approaches require optimization of these approxi- 
mate parameters for each case. Improvements in the optimization 
and fitting of models to data have been proposed with methods 
such as approximate Bayesian computation (ABC) [17] presented 
as a way to improve fitting strategy by sampling values from an 
approximation of the posterior distribution while not calculating 
explicitly the likelihood function. 

The alternative to kinetic models, constraint-based modeling, 
lacks the representation of metabolite concentration and enzyme 
regulation afforded by kinetic models. Instead, these so-called 
genome-scale metabolic models (GEMs) combine gene sequence 
information with omics data to provide a map of intracellular 
metabolism for an organism through calculation of the stoichio- 
metric matrix. GEMs have been used for a number of different 
applications, for example, flux balance analysis (FBA) [18] or met- 
abolic balance analysis (MBA) [19] as well as testing of synthetic 


Table 1 


Examples of resources available for model development, ML examples, as well as metabolic models 


Metabolism model 
development 


Bayesian modeling 


Logical modeling 


Dynamic modeling 
through ordinary 
differential equations 


Stochastic modeling 


Stoichiometric modeling 


Agent-based modeling 

ML tools 

Longitudinal GPR 
(LonGP) 

LSTM used in 
metabolism modeling 

Metabolism model 
database 

BioModels 

SABIO-RK 


BRENDA 


eQuilibrator 


Software application 


GRASP [17] 


CellNetOptimizer (http: // 


www.cellnopt.org) 


GINsim (http://ginsim.org) 


COPASI [63] 
CellDesigner [64] 
VCell [65] 


COPASI [63] 
StochKit [68] 


MaBoSS (http://maboss. 


curie.fr) 


COBRA [57] 
CobraPy [57, 70] 
Raven 2.0 [58] 
Merlin [71] 


ARCADE [73] 


Software application 


https: //github.com/ 
chengl7/LonGP [40] 


https://github.com/youlab/ 
pattern_prediction_NN_ 


Shangying [37] 


Software application 


https: //www.ebi.ac.uk/ 
biomodels [74] 


http: //sabio.h-its.org/ 

[75] 

https: //www.brenda- 
enzymes.org/ [16] 


https: //equilibrator. 
weizmann.ac.il/ [76] 


Examples of applications in cell culture 
metabolomics 


Methionine cycle modeling using 
approximate Bayesian computation [17] 


Combination of cell line proteomics and 
metabolomics data logic mechanistic | 
modeling to explain heterogeneous drug 
response in cellular cholesterol regulation 


[62] 


Many examples of COPAST’s use in 
biotechnology cell modeling are reviewed 
in [66] recent example of hybrid 
cybernetic modeling that combines 
dynamic modeling between different 
metabolic states for CHO cells [67] 


Theoretical foundation to study metabolism 
in conjunction with stochastic enzyme 
expression has been presented showing 
metabolic heterogeneity resulting from 
enzyme-level stochasticity [69 | 


Genome-scale stoichiometric 
reconstructions and computational 
models of mammalian metabolism 
particularly for CHO cells coupled to 
protein secretion [72] 


Extensive review of agent-based methods for 
cancer cell modeling [37 | 


Examples of some application in cell 
culture metabolomics 


Additive GPR method for non-parametric 
analysis of longitudinal data 


LSTM for improvement of parameter 
modeling based on mechanistic models 

Type of resource 

Model repository 

Kinetic information 


Kinetic information 


Database of biochemical equilibrium 
constants and Gibbs free energies 


Hybrid Methods for Metabolic Pathway Modeling 423 


lethality of genes [20] and determination of off-target drug effects 
[21]. GEMs are built on a network connection of all metabolic 
reactions that are known to occur in an organism combining meta- 
bolites, genes, and protein information to inform observed changes 
in metabolite concentrations across conditions. 

The potential to determine and work from the entire metabolic 
reaction network derived directly from genome information opens 
an opportunity for building complete metabolic maps for any 
organism as well as subsets of metabolic networks for different 
biological systems. GEMs can simulate flux for all known metabo- 
lites. Additionally, they can provide a platform for multiomic anal- 
ysis as well as a system for an evaluation of the complete 
metabolome space with sparse metabolomic profiling data. How- 
ever, their reaction maps are often underdetermined, with more 
reactions than metabolites; thus, they generate many possible solu- 
tions often too complex for the majority of applications [1]. A 
number of approaches to address this issue and simplify these 
models for specific applications include the utilization of transcrip- 
tomic, proteomic, and metabolomic data to remove unlikely reac- 
tions as well as the addition of biological, physical, or chemical 
constraints [22—24]. Gene expression data is commonly used to 
extract the subset of reactions that are active in a specific situation 
and silence reactions catalyzed by enzymes that are not expressed. 
Although this approach is efficient, it makes a very serious assump- 
tion that gene expression activity measured at a given time pointina 
mixture of cells is linked to gene-protein-reaction network at steady 
state. This assumption is an oversimplification of the highly com- 
plex relationship between proteins, metabolite fluxes, and gene 
expression. As an example, the most complete GEM for metabo- 
lism of human cells — Recon3D — provides a network of 10,600 
reactions linking 5835 metabolites and 2248 genes [25]. Recon3D 
provides avery good coverage of hydrophilic metabolites; however, 
while it includes a number of lipid pathways, its coverage of the 
lipidome is essentially incomplete, making it difficult to extend 
beyond metabolomics. 

The lack of network solutions for lipidomic data makes lipido- 
mics highly amendable to data-driven modeling. Development of 
mechanistic lipid metabolism kinetic models or a complete repre- 
sentation of lipid processes via GEMs remains highly challenging 
due to the diversity of lipid functions and their enzymes. As classi- 
fied by the LIPID MAPS consortium [26], lipids are divided into 
eight categories and further subdivided into multiple classes, sub- 
classes, divisions, and molecular species each with specific roles and 
synthesized or remodeled by overlapping enzymatic pathways. Cur- 
rent estimate of the number of lipid species in biological life ranges 
from 9000 to 100,000 [27]. This diversity in lipid structures and 
functions makes the mapping of all interconnections of lipids 
impossible as of today. In addition, the enzymes which regulate 


424 Miroslava Cuperlovic-Culf et al. 


1.2 Improving Cell 
Metabolism Modeling 
with ML 


lipids are promiscuous, catalyzing several different reactions with 
different specificities for the hydrocarbon chains that define lipid 
identities [28 ]. Without detailed substrate affinities, it is difficult to 
predict which lipids at the molecular level will be impacted by a 
change in condition or state. As a further challenge to all metabo- 
lomic modeling, cellular reactions are compartmentalized, with 
enzymes localizing to specific organelles within cells and to specific 
tissues within an organism. Thus, modeling must consider not only 
lipid abundances and enzymatic function but also their transport 
and, ideally, their subcellular concentrations. As an example, acid 
ceramidase encoded by ASAH1 localizes to the lysosome and cat- 
alyzes the hydrolysis of ceramides to their constituent sphingoid 
base and free fatty acid at pH = 4.5. If the enzyme is mislocalized or 
lysosomal pH is alkalinized, then acid ceramidase catalyzes the 
reverse reaction, increasing the abundance of ceramides from a 
sphingoid base and a free fatty acid [29, 30]. Under physiological 
conditions, acid ceramidase displays substrate preference for cera- 
mides and free fatty acids with unsaturated N-acyl hydrocarbon 
chains of 6-16 carbons [29]. 


ML methods can be viewed as a combination of algorithms that 
learn and generalize functional dependencies from experiences, 
data, to identify high-order correlations and then generate predic- 
tions from data. At the most basic level, ML methods can be 
divided into two approaches: unsupervised and supervised. Unsu- 
pervised methods aim to determine variation, correlations, groups, 
or functional dependencies among samples without any input of 
sample labels from an external “supervisor” [31]. Supervised meth- 
ods on the other hand rely on the inputted sample labels and try to 
develop models that predict targets and underlay the supervised 
group classification. Regression analysis is part of supervised ML, 
where algorithms are trained with input and output features to 
provide predictive modeling for continuous outcome (e.g., metab- 
olite concentration over time) based on the value of one or more 
predictor, input value, system parameter, or condition 
characteristic. 

Specific roles of ML in combination with mechanistic metabo- 
lism modeling are: 


1. Integration of in silico mechanistic modeling results with other 
omics data. 

2. Determination of parameters for mechanistic models from 
data- or theory-driven ML. 


We review example methods that have been applied with suc- 
cess below and then provide specific methodology protocols. 


1.2.1 Integration of in 
Silico Mechanistic 
Modeling Results with 
Other Omics Data 


1.2.2 Determination of 
Parameters for Mechanistic 
Models from Data- or 
Theory-Driven ML 


Hybrid Methods for Metabolic Pathway Modeling 425 


To achieve integration, the user must first develop and optimize a 
mechanistic model and then use the data obtained from this model 
for ML analysis of the system. ML system exploration can use the 
results of the simulation or combine simulation outputs with other 
relevant data about the system under investigation. As a proof of 
principle, a combination of ML and multiomics data were used to 
effectively predict pathway dynamics in [32, 33]. In this approach, 
metabolism models can be done at any scale from whole network 
GEM models to very small models including successful recapitula- 
tion of lipid metabolism (reviewed in [34]). Here, ML is subse- 
quently used as a tool for data mining rather than modeling. A 
small number of examples, combining GEM and ML methods, 
have shown potential for utilization of both supervised and unsu- 
pervised ML for this type of application. As an example, when used 
for analysis of the effect of inhibitors on metabolism, GEMs can 
provide simulation of flux differences following disruption of a 
specific metabolic step. In this approach, ML can be used to deter- 
mine major changes across the network between control and in 
silico “treated” cases. Shaked et al. [35] have used support vector 
machine (SVM) and random forest (RF) ML methods to determine 
major metabolic alterations from simulated flux data obtained 
using flux variability analysis (FVA) following inhibitory drug sim- 
ulation through gene deletion analysis. In this way, ML was used to 
determine drug side effects on the metabolic network [35]. In 
another very significant application, GEM and ML models were 
combined during learning tasks by embedding stoichiometric con- 
straints in the ML model training process [36]. In this approach, 
dynamic elementary mode regression discriminant analysis was 
developed to identify the most discriminant pathway activation 
patterns between different conditions [36]. 


Mechanistic models require optimization of parameters from data 
where, in the majority of cases, models cannot be solved analyti- 
cally; thus, parameter optimization requires numerical methods. 
These methods are often slow and, for a large number of para- 
meters with exponentially increasing number of combinations, 
unable to perform large-scale explorations of the complete param- 
eter space. Yet the complete parameter space must be interrogated 
in order to determine global, optimal parameter or input choices. 
Long short-term memory (LSTM) deep learning-based network 
analysis method has shown promising results for the acceleration of 
this parameter optimization with high accuracy [37]. LSTM was 
introduced as a way to resolve problems of exploding/vanishing 
gradients that recurrent or very deep neural networks face when 
trying to learn long-term dependencies [38]. LSTM has been 
developed for processing continuous series of data [39] including 
time course sequences (as is usually the case in mechanistic models) 
or series of outcomes for combinations of input parameters 


426 


Miroslava Cuperlovic-Culf et al. 


(as needed for optimization of model routine). The strength of 
these deep learning methods lies in the capacity to establish a map 
of outcomes from the training data. In the LSTM application, a 
small subset of data generated using mechanistic models is used to 
train neural network that then provides faster coverage of the 
parameter space to determine optimal combination for a given 
system. 

A very detailed outline of LSTM methodology with examples 
of LSTM architecture used for metabolism modeling is provided in 
[37, 38]. In this arrangement, the cell remembers, i.e., holds, 
values over some time or point intervals, and the gates control 
and regulate the flow of information into the cell. LSTM is ulti- 
mately built from a set of recurrently connected subnetworks where 
each block maintains its state and regulates information flow 
through its nonlinear gating units. In the applications reviewed in 
[37, 38], LSTM is used to determine mechanistic model input 
parameters as it was able to search through a larger space of param- 
eter options with a relatively small training set of random para- 
meters and mechanistic model predicted molecular outputs. LSTM 
networks were shown to provide reliable and, most importantly, 
novel patterns of parameters suggesting that they are not limited to 
passive repetition of the training information but provide real 
mapping between input and output parameters. In this approach, 
neural network model building focuses on an empirical mapping of 
combinations of input parameters to system outputs of interest and 
provides a much faster way to search input parameter space while, at 
the same time, providing very accurate models for output para- 
meters. For exploration, Vanilla LSTM is readily available in Python 
or MATLAB applications. 

An alternative approach to training ML models with data and 
mechanistic models is to use biological knowledge to develop more 
appropriate ML models that can then be trained with smaller 
datasets providing knowledge-constrained modeling. Gaussian 
process regression (GPR) is a method of great interest in this type 
of application. In GPR, analysis and modeling of time-series data 
and the determination of parameters and models can be viewed as a 
regression problem where the goal of inference is to determine the 
putative form of the time-dependent function and to obtain the 
probability distribution of the dependent value on the variable. In 
the sense of metabolism modeling, regression problems would take 
the form of c(t) = f(@(4)) + €. This functional dependence deter- 
mination can be viewed as a curve fitting that assumes that ¢c) is 
ordered by ¢2, where c, is a function of time. GPR models can 
provide nonlinear system modeling, can be trained with smaller 
datasets, and can automatically output values that include the vari- 
ance and confidence interval of the model. In addition, prior 
knowledge can be incorporated into the GPR model before train- 
ing through optimization of covariance and kernel function. Here, 


Hybrid Methods for Metabolic Pathway Modeling 427 


Analytical measurement Data type Data processing 


MS-based analysis MS data 
Untangetie (Orbitrap, TOF) Baseline subtraction 


Peak detection 


Peak integration 
RT alignment 


Adduct annotation 
Biological Deisotoping 
samples 


Targeted (QqQ) 


Perturbation ‘ 
Precursor ton Fragmentation Fragment lon 


treatment 
Selection chamber selection 


NMR-based analysis ID eZD NMR spect NMR data 


Sample Apodization 


Lipid, solid, gas zh Fourier Transform 


i —— | i : Phasing 
[ema | 7 ' Baseline correction 
‘ Output - FID Pe Peak assignment 
pe | eee Peak quantification 


ML for Normalization & 
Metabolism Quality Control 


Ace -t,-6,-R.=0 . s 5 . . 
— Bir,+r,:R=0 Normalization to unit of input material 
Ca2r,+4,-1,20 


Normalization to reference sample 

Correction for extraction efficiency & 
instrument variability 

Batch correction 


Fig. 2 Brief outline of two approaches linking mechanistic and machine learning (ML) models for (a) using ML 
for combined analysis of simulation results and omics data and (b) using ML for increased parameter space 
search coverage in order to increase 


kernels can be viewed as flexible nonlinear functions that can be 
optimized and developed to define how quickly the regression 
function will vary. A related example of utilization of GPR in 
modeling of longitudinal processes was recently presented in [40]. 

Although many different ML approaches can be combined 
with mechanistic modeling in a variety of ways and for a range of 
applications, a number of similar procedural steps are required for 
application of any ML method in either analysis of model-derived 
data or augmentation of mechanistic models. Method section lists 
procedures for utilization of LSTM and GPR in modeling with 
similar protocols required for other ML model utilizations. The 
Materials section below provides some software tools and links to 
major metabolism modeling databases. The Methods section below 
provides detailed protocols with Fig. 2 giving a schematic presen- 
tation of these procedures. 


428 Miroslava Cuperlovic-Culf et al. 


2 Materials 


3 Methods 


3.1 Using 
Mechanistic Models to 
Produce Data for 
Incorporation into ML 
Classifiers 


3.1.1 MS-Based 
Lipidomic and 


Information about Web resources providing data, information, and 
software for metabolic modeling that can support ML and hybrid 
model development is presented in Table 1. 


Development of a high-quality model relies upon (1) the intimate 
knowledge of the system in question, (2) the articulation of appro- 
priate hypotheses to test the models using experimental data, and 
(3) a feedback workflow to inform the model for rebuilding and 
validation. The experimental data used for modeling should be 
obtained using robust, high-throughput, analytical techniques 
that allow for rapid identification and reliable quantification of 
metabolites. In this context, metabolomic and lipidomic datasets 
are predominantly generated by mass spectrometry (MS)-based 
and nuclear magnetic resonance (NMR) approaches. Brief outline 
of methods is shown in Fig. 3. 


MS offers a sensitive, quantitative, technical solution and includes 
the possibility of devising and coupling experiments to produce 


Metabolomic Data structural information of countless metabolites in a single acquisi- 
tion. Considerations of data processing are as follows: 
A ML for combined analysis of B_ Faster search of parameter 
mechanistic modeland data — = space through ML 


Mechanistic model 


Generate random parameter set: 
ed Media concentrations, kinetic 
Data ac” parameters, ... 


| ® input 


odel optimization co oe ¢ ML model cad 
| weetment / 


Model generated data * Trainingand 


=== ub Mechanistic model 
: output Concentrations, fluxes, ... 
Test set validated 


Additional data model 


ML analysis Faster search of cages 


parameters’ space 


Fig. 3 Schematic representation of NMR- and MS-based metabolomics and lipidomics analysis providing data 
for model development. Included are major steps going from sample preparation, analytical methodologies, 
assignment, and data preprocessing 


Hybrid Methods for Metabolic Pathway Modeling 429 


1. Untargeted MS analyses provide an unbiased approach to 
simultaneously measure a large number of metabolites or lipids 
within a sample without prior knowledge of lipid and metabo- 
lite categories. Strengths are the broad coverage afforded by 
the high-resolution mass analyzers used to discriminate lipids 
based on mass to charge (m/z). Weaknesses lie in the complex- 
ity of the matrices analyzed such that high abundance metabo- 
lites are favored over low abundance ones despite multiple 
front-end separation approaches (i.e., gas chromatography, liq- 
uid chromatography, ion mobility, etc.). Quantification is done 
in a semiquantitative manner. Without reducing matrix com- 
plexity, the large quantity of metabolites and lipids results in 
ion suppression due to co-elution, as well as in detector satura- 
tion. These limitations are offset by the high-resolution mass 
scanning of the precursor ion which enables identification 
based on m/z. A comprehensive review of the technologies is 
provided in [41, 42]. 

2. Targeted MS analyses focus on a predefined set of metabolites 
and lipids by parking on a diagnostic ion using triple quadru- 
pole or QTRAP mass analyzers wherein the third quadrupole 
can be switched to trap fragmented ions for structural verifica- 
tion (reviewed in [41, 42]). By coupling chromatography to 
targeted MS methods, higher-resolution and more reliable 
quantification of metabolites can be achieved. In addition to 
derivatization by GC, a variety of LC methods such as normal 
phase, reversed phase, and hydrophilic interaction LC, ion pair 
chromatography is another strategy commonly employed in 
metabolomic analysis for the separation of ionic metabolites 
[41, 42]. The targeted metabolomic and lipidomic pipelines 
generally utilize tandem mass spectrometry to obtain high 
selectivity, enhanced sensitivity, and reliable quantification of 
metabolic targets by reducing noise from isobaric species. As 
such, targeted MS analyses aim to perform close to absolute 
quantification. This is achieved by performing tandem MS 
experiments such as multiple reaction monitoring (MRM, 
with or without schedule) to restrict analysis to a predefined 
set of metabolites or lipids. The data reduces complexity by 
quantifying a single lipid or metabolite subclass at a time (aka 
exploring 1000 in lieu of ~10,000 metabolites at a time). 
Limitations are the number of analyses required to explore 
the entire lipidome/metabolome. It is important to note that 
data from both untargeted and targeted approaches comple- 
ment metabolomic modeling approaches. 


3. Post-acquisition data processing in both MS approaches 
involves noise filtering and baseline correction, peak detec- 
tion /selection, adduct annotation and deisotoping, peak align- 
ment, and further deconvolution if necessary. Typically, in 
untargeted MS analyses, due to the broad coverage of 


430 Miroslava Cuperlovic-Culf et al. 


3.1.2 NMR-Based Data 


metabolites, the mass spectrum and chromatogram are 
saturated with noise signals. The removal of these noise signals 
involves establishing a set threshold and subtracting this 
threshold from the measurement. Similarly, this type of analysis 
likely will also contain detection of isotopic peaks of metabo- 
lites, which need to be removed to simplify the final dataset. 
For both untargeted and targeted MS analyses, specific para- 
meters such as Gaussian smoothing, peak splitting, acceptable 
peak width, and retention time windows must be established 
for peak picking. This ensures consistency in data analysis and 
avoids false-positive signals. Finally, peak alignment is an 
important step in post-acquisition data processing to obtain 
correct identity assignment for each MS signal. Peak alignment 
and annotation are often performed by multiple peak features 
dependent on the separation methodology employed. Several 
alignment programs and algorithms have been developed for 
this purpose [43-47 ]. 


4. For post-acquisition normalization, the MS - signal 
corresponding to each monitored metabolite or lipid, whether 
obtained in untargeted or targeted approaches, is normalized 
against an internal standard, critically of the same class as the 
analyte and either expressed as pmol equivalents of this stan- 
dard or placed back onto standard curve of a known, normal- 
ized standard. Following this quantification from sample 
extract, the normalized MS signals need to be expressed 
according to the amount of starting biological material (e.g., 
liquid volume, cell number, tissue wet weight, etc.). 


NMR can be used for nondestructive, continual, or in vivo mea- 
surements in biofluids, tissues, and intact tissues and in solid, 
semisolids, and gas phases, with variety of different experiments 
and instrument profiles and measurements of multiple different 
nuclei (e.g., 'H, TON, SC, *P), separately or simultaneously. In 
terms of metabolism modeling, NMR can provide longitudinal 
measurement for a system by either continual sampling or in vivo 
NMR measurement. Sample acquisition is limited with NMR 
experiments monitoring between 50 and 200 metabolites of high 
abundance (with concentrations greater than 1 pM). Briefly, steps 
in data derivation using NMR are as follows: 


1. It is essential to select the appropriate experiment for the 
system of interest — for fast, high-throughput, or continual 
sample monitoring and quantification, preferred are 1D experi- 
ments with water suppression (e.g., 1D NOESY or 1D CPMG) 
that require minimal sample preprocessing (in the basic case 
only involving addition of NMR reference material and pH 
buffer), while 2D NMR provides possibility for analysis of 
complex systems with unknown metabolites. Sample prepara- 
tion for different applications is reviewed in great detail 
elsewhere [48 ]. 


Hybrid Methods for Metabolic Pathway Modeling 431 


2. Data processing from any NMR experiment involves signal 
processing (apodization, Fourier transform, phasing) and nor- 
malization (relative to NMR reference). Resulting spectrum 
provides both peak positions (in ppm) that can be used for 
assignment and peak intensities that are directly related to the 
analytes’ concentrations. With addition of internal reference, 
NMR can be used for absolute quantification of metabolites in 
the sample and comparison between different samples or time 
points. 


3. Metabolite assignment is performed in reference standards as 
described in [49-51] with a number of methods available for 
different sample types. Important considerations are that peak 
position shifts due to sample properties (i-e., pH, osmolality) 
and that line widths change with change in the magnetic field 
strength, sample viscosity, and composition possibly leading to 
changes in peak overlaps that can lead to errors in assignments. 
Thus, assignment and quantification should be done using 
information for comparable systems with specific assignment 
and quantification methods available, for example, for human 
blood or cerebrospinal fluid [52]. Several general methods are 
available, but prior to their utilization, the user should adjust 
parameters for specific sample set (reviewed recently in [53]). 


3.2 Prepare Omics A number of preprocessing steps are universally required for the 
Data for Further Model development of mechanistic models regardless of the modeling 
Development approach and omics data collected. These include: 


1. Data assignment and quantification. 


2. Using either novel data or information available in published 
databases, high quality, and relevant longitudinal data is 
required to build the model and optimize parameters. For 
metabolism modeling, it is essential to have assigned and quan- 
tified features measured for the specific biological system under 
conditions of interest. Genomics, transcriptomics, and/or pro- 
teomics should be used for contextualization of genome-scale 
models, and metabolomics/lipidomics or flux data are used for 
parameter determination in kinetic models or network optimi- 
zation in GENs. Kinetic parameters are available for many 
enzymatic reactions from ex vivo measurements (Table 1). 


3. Missing data imputation: Due to biological or technical rea- 
sons, some features will remain unidentified or unquantified. 
Depending on the cause for missing data, analysts should fol- 
low different strategies. Features with a large number of miss- 
ing values across conditions (of the order of 20-30% missing 
values) should be excluded from further analysis. Features with 
low abundance or undetected in specific samples where values 
fall below levels of detection can be imputed with a value that is 


432 Miroslava Cuperlovic-Culf et al. 


3.3 Develop a 
Mechanistic Model of 
Metabolic Processes 
of Interest 


a ratio of the lowest measurable value for the species (using 14 
or 1% of the lowest measured value for that feature) or set to 
0. Values missing due to experimental or technical errors can be 
imputed using computational methods, calculating missing 
values based on comparison with measured values in other 
samples determined to be similar. Extensive benchmarking of 
imputation methods has been presented recently [54] showing 
that in the majority of tests, random forest-based imputation 
provides an excellent approach for missing data estimates. 


4. Data scaling from different experimental platforms. As a variety 
of data sources can be used in the development of a metabolic 
model, it is crucial to perform appropriate normalization for 
each data type using either standard or internal references or 
relative feature levels before combining data for model build- 
ing. The analyst must also decide if low and high abundance 
analytes are placed on the same scale to ensure equal represen- 
tation. Methods have been discussed in great details previously 


[55, 56]. 


For the network of interest, first develop a set of ODEs or PDEs 
describing all reactions of interest in the model with appropriate 
dependencies and sink points in the format of Eq. 1. For large 
systems, an exact solution is not possible, and generally two 
approaches are applied. (1) Generate a quasi-steady-state assump- 
tion and resolve to the genome-scale model (2.b), or (2) use math- 
ematical functions to describe V(E, c, k) function applying available, 
measured, or estimated values for parameters (2.c): 


1. For genome-scale model development, omics data provided for 
the system of interest (e.g., genomics, transcriptomics, proteo- 
mics, metabolomics, lipidomics) are used for the development 
of the personalized genome-scale FBA model. In particular, 
gene transcription and gene mutation information are 
integrated to develop contextualized genome-scale models 
where information about lack of function (through either 
mutation or gene knockdown) can be used directly to delete 
unrelated reactions. Methods for optimization of models are 
available in COBRA [57] or RAVEN [58]. Both tools operate 
in MATLAB or Python and provide a variety of different opti- 
mization routines for the development of contextualized mod- 
els and optimization of metabolic flux. Recon3D provides a 
complete known metabolic network [25, 57]. The COBRA 
platform allows for the addition of new reactions and features. 


2. For dynamic network reactions, thermodynamic information 
can be obtained from existing databases (‘Table 1) ensuring that 
the kinetic information is curated and is up-to-date and for the 
appropriate species under investigation. The functional form of 
VE, c, k) can be approximated using Michaelis-Menten 


3.4 Integrate 
Mechanistic Model of 
Metabolic Processes 
with ML 


3.5 Examples of 
Methods 


Hybrid Methods for Metabolic Pathway Modeling 433 


equation or other, more detailed formalisms and can possibly 
include inhibition and activation interactions. It is critical to 
ensure that used kinetic constants match the model type and 
units of metabolomic data. 


. FBA must be optimized for desired properties. This can be 


achieved by maximizing, for example, biomass production or 
cell growth using COBRA [57]. For dynamic models, kinetic 
parameters can be optimized from available data for the system. 
Optimization can be done using numerical methods or ML 
methods (e.g., LSMA; see Method B). 


. Experimentally validate FBA model by comparing predicted 


individual metabolite levels with matched pairs of metabolites 
measured in the metabolomic screen. 


. If stochastic aspects are significant for simulation, include ran- 


domness, for example, by using chemical Langevin formulation 
or Poisson mixture model (PMM) as recently presented [59 ]. 


. Integrate in silico fluxomic and other omics data: Data integra- 


tion can be performed in three ways — (a) early integration, 
concatenation of data into a unique dataset, (b) intermediate 
integration wherein the ML model is built using a combined 
transformation of the separate input sets, and (c) late integra- 
tion, where a separate model is built for each dataset and 
models are fused. Following integration, all data should be 
scaled, for example, by z-score scaling (see 2c). In the cross- 
validation process, training data should be normalized, and the 
same normalization parameters should be used for the test set. 
In the case of z-score normalization, the training set is normal- 
ized, and the mean and standard deviation values of the train- 
ing set are used to normalize the test set in order to prevent 
information leakage. 


. Develop ML architecture that allows analysis of integrated data: 


A variety of methods are available and can be explored with 
method proposed below resulting from [60]. Approaches for 
fusing experimental results with knowledge-based in silico 
models through interpretable ML are reviewed here [33]. 


. Combination of data: Data-independent ensemble ML can be 


used to combine all data (using the late integration approach; 
see above) including omics as well as the predicted metabolic 
data run by individual base learners. Subsequently, prediction 
and probabilities of prediction are combined for each base 
learner under meta-learner output with weights for each pre- 
dictor. The final probability of result is p = }°p;w; where 2 is 
base learner with probability of prediction p; and weight 1;. 
Alternatively, fluxomic data can be combined with other omics 
data and analyzed together using ML (with early or 


434 


Miroslava Cuperlovic-Culf et al. 


intermediate data integration). Multimodal artificial neural 
network (MANN) method has shown the best performance 
for combined analysis of fluxomic and transcriptomic data [33 ]; 
however, different combinations and sizes of data require opti- 
mization of ML methods for any given application. 


. Optimization of hyperparameters for the model: Gradient boost- 


ing machine (GBM) algorithm can be used with Bayesian 
optimization for determining optimal hyperparameter values. 
Bayesian optimization is run in multiple iterations with fivefold 
cross validation used to determine the performance of selected 
hyperparameters. The weighted log loss must be calculated to 
determine performance metric for GBM and also to determine 
model performance on validation sets. The formula for 
weighted log loss is: 


Ns y [—(way;log (2;) + (1—»,) log (1—p,))] 2) 


with y; the true class label of sample 2, p; the predicted probabil- 
ity of sample z having predicted label, wg the weight for given 
label, and N, the total number of samples. Overfitting can be 
prevented by early stopping of the optimization process. Mean- 
weighted log loss with one standard error over all five folds of 
cross validation is used to determine the best hyperparameter 
set performance. 


. Test quality of ML model using cross validation: Data are split 


into training and testing and validation datasets. The training 
set, usually randomly selected 80% of the complete dataset, is 
used for training the model with a user-defined set of hyper- 
parameters. The validation part of the data (usually the remain- 
ing 20%) is used to assess model performance according to the 
set of hyperparameters optimized using the training set. 


. Test classifier performance for multiple iterations of randomized 


training/validation and testing data split: Preferred perfor- 
mance metrices are weighted log loss (Eq. 2), area under the 
receiver operator curve (AUROC), as well as measures com- 
paring true positive (TP), false positive (FP), true negative 
(TN), and false negative (FN) including: 


Sensitivity = 5 py (3) 
Specificity = —e (4) 
1 Peg EN 
Balanced Accuracy = 5 (op TEN t+ TN rs) (5) 


3.6 Determination of 
Parameters for 
Mechanistic Models 
from Data- or Theory- 
Driven ML 


Hybrid Methods for Metabolic Pathway Modeling 435 


. Determine the importance of features in predictive models or 


classification: Feature selection for both individual groups of 
samples and across combined samples can be done by calculat- 
ing SHapley Additive exPlanations (SHAP) values for each 
classifier [61 ]. 


. Develop kinetic or constrained metabolic model as listed in 3.3. 


. Generate combinations of input parameters randomly, if infor- 


mation is available, constrain parameter values within allowed 
range. Parameters can include, for example, kinetic constants, 
cell growth rate, cell motility, and media metabolite concentra- 
tions. Model output values can include metabolite concentra- 
tion change over time, biomass information, and cell density as 
calculated by metabolic model. 


. Develop LSTM architecture with input layer, a fully connected 


layer, LSTM arrays, and two output layers, one for predicting 
peak values of distributions and one for predicting the normal- 
ized distributions. Vanilla LST'M is available in MATLAB and 
Python (TensorFlow or PyTorch). In the application of GPR, 
with prior information, architecture development requires 
selection or generation of appropriate kernel functions with 
possibility for additive kernel functions. 


. Perform input and output data preprocessing, including data 


scaling with, for example, min-max scaling to get all data to the 
0-1 range or z-score normalization. 


. Use the calculated molecular value distribution obtained in 


3.6.2. with a random combination of parameters to train ML 
models. 


. In the application of LSTM, parameters are used as input and 


molecular values as output of the neural network model. Ran- 
domly divide the data into training and test sets for cross- 
validation assessment of model accuracy, or use leave-one-out 
cross validation. 


. In LSTM, model input parameters are connected first to all 


neurons in the fully connected layer. Select the activation func- 
tion (e.g., exponential linear unit), and initialize connection 
weights randomly. 


. Optimize the network using, for example, cross entropy, and 


calculate the cost function of the neural network using mean 
squared error. 


. Evaluate the model using the test set with, for example, calcu- 


lation of root mean square error (RMSE) to determine the 
difference between LSTM and mechanistic model results. 


436 


Miroslava Cuperlovic-Culf et al. 


10. For prediction of new values, use developed LSTM with new 
parameter inputs, and for enhanced accuracy, use the ensemble 
approach, for example, with Wisdom of the Crowd analysis. In 
this approach, calculations are rerun with the same input, and 
similarity scores are calculated between different predictions 
using RMSE, R2, or some other similarity assessment function. 
Each prediction is evaluated with an assessment score relative 
to the average prediction and the result with the minimal score, 
i.e., minimal deviation from the average score is selected as the 


final prediction result. 


Acknowledgments 


Work was supported in part by operating grants AI-4D-102-3 to 
SALB and MCC from the National Research Council AI for Design 
Challenge Program, RGPIN-2019-06796 to SALB from the Nat- 
ural Sciences and Engineering Research Council of Canada 
(NSERC), as well as an NSERC CREATE Matrix Metabolomics 
Training grant to SALB. TTN received an NSERC CREATE 
Matrix Metabolomics Graduate Scholarship. 


References 


1. 


Richelle A, David B, Demaegd D et al (2020) 
Towards a widespread adoption of metabolic 
modeling tools in biopharmaceutical industry: 
a process systems biology engineering perspec- 
tive. NPJ Syst Biol Appl 6(1):6 


. Puniya BL, Amin R, Lichter B et al (2021) 


Integrative computational approach identifies 
drug targets in CD4(+) T-cell-mediated 
immune disorders. NPJ Syst Biol Appl 7(1):4 


. Blais EM, Rawls KD, Dougherty BV et al 


(2017) Reconciled rat and human metabolic 
networks for comparative toxicogenomics and 
biomarker predictions. Nat Commun 8:14250 


. Bordbar A, Jamshidi N, Palsson BO (2011) 


iAB-RBC-283: a proteomically derived 
knowledge-base of erythrocyte metabolism 
that can be used to simulate its physiological 
and patho-physiological states. BMC Syst Biol 
5:110 


. Thomas A, Rahmanian S, Bordbar A et al 


(2014) Network reconstruction of platelet 
metabolism identifies metabolic signature for 
aspirin resistance. Sci Rep 4:3925 


. Rico J, Nantel A, Pham PL et al (2018) Kinetic 


model of metabolism of monoclonal antibody 
producing CHO cells. Current 
Metabolomics 6 


. Nguyen TNT, Sha S, Hong MS et al (2021) 


Mechanistic model for production of 


10. 


ll. 


12. 


13. 


14. 


recombinant adeno-associated virus via triple 
transfection of HEK293 cells. Mol Ther Meth- 
ods Clin Dev 21:642-655 


. Chandrasekaran S, Zhang J, Sun Z et al (2017) 


Comprehensive mapping of pluripotent stem 
cell metabolism using dynamic genome-scale 


network modeling. Cell Rep 21(10): 
2965-2977 
. Cuperlovic-Culf M (2018) Machine learning 


methods for analysis of metabolic data and 
metabolic pathway modeling. Meta 8(1) 
Srinivasan S, Cluett WR, Mahadevan R (2015) 
Constructing kinetic models of metabolism at 
genome-scales: a review. Biotechnol J 10(9): 
1345-1359 

Helmy M, Smith D, Selvarajoo K (2020) Sys- 
tems biology approaches integrated with artifi- 
cial intelligence for optimized metabolic 
engineering. Metab Eng Commun 11:e00149 
Borzi A (2020) Modelling with ordinary differ- 
ential equations: a comprehensive approach, 
Ist edn. Chapman and Hall/CRC 

von Stosch M, Peres J, de Azevedo SF et al 
(2010) Modelling biochemical networks with 
intrinsic time delays: a hybrid semi-parametric 
approach. BMC Syst Biol 4:131 

Srinivasan B (2021) A guide to the Michaelis- 
Menten equation: steady state and beyond. 
FEBS J 


15. 


16. 


17 


18. 


19. 


20. 


21, 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


Hybrid Methods for Metabolic Pathway Modeling 


Wittig U, Kania R, Golebiewski M et al (2012) 
SABIO-RK--database for biochemical reaction 
kinetics. Nucleic Acids Res 40( Database issue): 
D790-D796 

Chang A, Jeske L, Ulbrich S et al (2021) 
BRENDA, the ELIXIR core data resource in 
2021: new developments and updates. Nucleic 
Acids Res 49(D1):D498-D508 


. Saa PA, Nielsen LK (2016) Construction of 


feasible and accurate kinetic models of metab- 
olism: a Bayesian approach. Sci Rep 6:29635 
Orth JD, Thiele I, Palsson BO (2010) What is 
flux balance analysis? Nat Biotechnol 28(3): 
245-248 

Jerby L, Shlomi T, Ruppin E (2010) Compu- 
tational reconstruction of tissue-specific meta- 
bolic models: application to human _ liver 
metabolism. Mol Syst Biol 6:401 

Zhang C, Bidkhori G, Benfeitas R et al 
(2018) ESS: a tool for genome-scale quantifi- 
cation of essentiality score for reaction/genes 
in constraint-based modeling. Front Physiol 9: 
1355 

Lewis NE, Nagarajan H, Palsson BO (2012) 
Constraining the metabolic genotype- 
phenotype relationship using a phylogeny of 
in silico methods. Nat Rev Microbiol 10(4): 
291-305 

Richelle A, Joshi C, Lewis NE (2019) Assessing 
key decisions for transcriptomic data integra- 
tion in biochemical networks. PLoS Comput 
Biol 15(7):e1007185 

Opdam S, Richelle A, Kellman B et al (2017) A 
systematic evaluation of methods for tailoring 
genome-scale metabolic models. Cell Syst 4(3): 
318-329. e316 

Aurich MK, Fleming RM, Thiele I (2016) 
MetaboTools: a comprehensive toolbox for 
analysis of genome-scale metabolic models. 
Front Physiol 7:327 

Brunk E, Sahoo S, Zielinski DC et al (2018) 
Recon3D enables a three-dimensional view of 
gene variation in human metabolism. Nat Bio- 
technol 36(3):272-281 

Fahy E, Subramaniam S, Murphy RC et al 
(2009) Update of the LIPID MAPS compre- 
hensive classification system for lipids. J Lipid 
Res 50(Suppl):S9-S14 

Shevchenko A, Simons K (2010) Lipidomics: 
coming to grips with lipid diversity. Nat Rev 
Mol Cell Biol 11(8):593-598 

Bennett SAL, Valenzuela N, Xu H et al (2013) 
Using neurolipidomics to identify phospho- 
lipid mediators of synaptic (dys)function in 
Alzheimer’s disease. Front Physiol 4:168 

Mao C, Obeid LM (2008) Ceramidases: regu- 
lators of cellular responses mediated by 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38 


39. 


40. 


41. 


42. 
43. 


44. 


437 


ceramide, sphingosine, and sphingosine-1- 
phosphate. Biochim Biophys Acta 1781(9): 
424-434 

Teichgraber V, Ulrich M, Endlich N et al 
(2008) Ceramide accumulation mediates 
inflammation, cell death and infection suscep- 
tibility in cystic fibrosis. Nat Med 14(4): 
382-391 

Bastanlar Y, Ozuysal M (2014) Introduction to 
machine learning. Methods Mol Biol 1107: 
105-128 


Costello Z, Martin HG (2018) A machine 
learning approach to predict metabolic path- 
way dynamics from time-series multiomics 
data. NPJ Syst Biol Appl 4:19 

Culley C, Vijayakumar S, Zampieri G et al 
(2020) A mechanism-aware and multiomic 
machine-learning pipeline characterizes yeast 
cell growth. Proc Natl Acad Sci U S A 
117(31):18869-18879 

Mc Auley MT, Mooney KM (2015) Computa- 
tionally modeling lipid metabolism and aging: 
a mini-review. Comput Struct Biotechnol J 13: 
38-46 

Shaked I, Oberhardt MA, Atias N et al (2016) 
Metabolic network prediction of drug side 
effects. Cell Syst 2(3):209-213 

Folch-Fortuny A, Teusink B, Hoefsloot HCJ 
et al (2018) Dynamic elementary mode mod- 
elling of non-steady state flux data. BMC Syst 
Biol 12(1):71 

Metzcar J, Wang Y, Heiland R et al (2019) A 
review of cell-based computational modeling in 
cancer biology. JCO Clin Cancer Inform 3: 
1-13 


. Van Houdt G, Mosquera C, Napoles G (2020) 


A review on the long short-term memory 
model. Artif Intell Rev 53:5929-5955 
Hochreiter S, Schmidhuber J (1997) Long 
short-term memory. Neural Comput 9(8): 
1735-1780 

Cheng L, Ramchandran S, Vatanen T et al 
(2019) An additive Gaussian process regression 
model for interpretable non-parametric analy- 
sis of longitudinal data. Nat Commun 10(1): 
1798 

Mass spectrometry-based 
approaches (2016) In: 
(ed) Lipidomics. pp 53-88 
Lipidomics (2017) Springer Protocols 
Chitpin JG, Surendra A, Nguyen TT et al 
(2021) BATL: Bayesian annotations for tar- 
geted lipidomics. Bioinformatics. in press 
Tsugawa H, Arita M, Kanazawa M et al (2013) 
MRMPROBS: a data assessment and metabo- 
lite identification tool for large-scale multiple 


lipidomics 
Hsu F-F 


438 


45. 


46. 


47. 


48. 


49. 


50. 


51. 


52. 


53. 


54. 


55. 


56 


57. 


58. 


Miroslava Cuperlovic-Culf et al. 


reaction monitoring based widely targeted 
metabolomics. Anal Chem 85(10):5191-5199 


Domingo-Almenara X, Montenegro-Burke JR, 
Ivanisevic J et al (2018) XCMS-MRM and 
METLIN-MRM: a cloud library and public 
resource for targeted analysis of small mole- 
cules. Nat Methods 15(9):681-684 

Niu W, Knight E, Xia Q et al (2014) Compara- 
tive evaluation of eight software programs for 
alignment of gas chromatography-mass spec- 
trometry chromatograms in metabolomics 
experiments. J Chromatogr A 1374:199-206 


Wang Y, Ma L, Zhang M et al (2019) A simple 
method for peak alignment using relative 
retention time related to an inherent peak in 
liquid chromatography-mass spectrometry- 
based metabolomics. J Chromatogr Sci 57(1): 
9-16 

Lin CY, Wu H, Tjeerdema RS et al (2007) 
Evaluation of metabolite extraction strategies 
from tissue samples using NMR metabolomics. 
Metabolomics 3(1):55-67 

Wishart DS, Jewison T, Guo AC et al (2013) 
HMDB 3.0--the human metabolome database 
in 2013. Nucleic Acids Res 41(Database issue): 
D801-—D807 

Velankar S, Burley SK, Kurisu G et al (2021) 
The protein data bank archive. Methods Mol 
Biol 2305:3-21 

Romero PR, Kobayashi N, Wedell JR et al 
(2020) BioMagResBank (BMRB) as a resource 
for structural biology. Methods Mol Biol 2112: 
187-218 

Ravanbakhsh S, Liu P, Bjorndahl TC et al 
(2015) Accurate, fully-automated NMR spec- 
tral profiling for metabolomics. PLoS One 
10(5):e0124219 

Wang RCC, Campbell DA, Green JR et al 
(2021) Automatic 1D (1)H NMR metabolite 
quantification for bioreactor monitoring. Meta 
11(3) 

Jager S, Allhorn A, Biessmann F (2021) A 
benchmark for data imputation methods. 
Front Big Data 4:693674 

Jauhiainen A, Madhu B, Narita M et al (2014) 
Normalization of metabolomics data with 
applications to correlation maps. Bioinformat- 
ics 30(15):2155-2161 


. Walach J, Filzmoser P, Hron K (2018) Data 


normalization and scaling: consequences for 
the analysis in omics sciences. Compr Anal 
Chem 82:165-196 

Heirendt L, Arreckx S, Pfau T et al (2019) 
Creation and = analysis of biochemical 
constraint-based models using the COBRA 
toolbox v.3.0. Nat Protoc 14(3):639-702 
Wang H, Marcisauskas S, Sanchez BJ, 
Domenzain I, Hermansson D, Agren R, 


59. 


60. 


ol. 


62. 


63. 


64. 


65. 


66 


67. 


68. 


69. 


70. 


71. 


125 


Nielsen J, Kerkhoven EJ (2018) RAVEN 2.0: 
a versatile toolbox for metabolic network 
reconstruction and a case study on Streptomy- 
ces coelicolor. PLoS Comput Biol 14(10): 
e1006541 

Cornish-Bowden A (2014) Fundamentals of 
enzyme kinetics. Elsevier 

Lewis JE, Kemp ML (2021) Integration of 
machine learning and genome-scale metabolic 
modeling identifies multi-omics biomarkers for 
radiation resistance. Nat Commun 12(1):2700 
Guyon I (2017) Advances in neural informa- 
tion processing system 30 pre-proceedings. 
NeurlPS 2017 

Blattmann P, Henriques D, Zimmermann M 
et al (2017) Systems pharmacology dissection 
of cholesterol regulation reveals determinants 
of large pharmacodynamic variability between 
cell lines. Cell Syst 5(6):604—619.e607 

Sahle S, Gauges R, Pahle J, et al. Simulation of 
Biochemical Networks Using Copasi — A Com- 
plex Pathway Simulator. In: Proceedings of the 
2006 Winter Simulation Conference, 2006 
Matsuoka Y, Funahashi A, Ghosh S et al (2014) 
Modeling and simulation using CellDesigner. 
Methods Mol Biol 1164:121-145 

Resasco DC, Gao F, Morgan F et al (2012) 
Virtual cell: computational tools for modeling 
in cell biology. Wiley Interdiscip Rev Syst Biol 
Med 4(2):129-140 


. Bergmann FT, Hoops S, Klahn B et al (2017) 


COPASI and its applications in biotechnology. 
J Biotechnol 261:215-220 

Martinez JA, Bulte DB, Contreras MA et al 
(2020) Dynamic modeling of CHO cell 
metabolism using the hybrid cybernetic 
approach with a novel elementary mode analy- 
sis strategy. Front Bioeng Biotechnol 8:279 
Sanft KR, Wu S, Roh M et al (2011) StochKit2: 
software for discrete stochastic simulation of 
biochemical systems with events. Bioinformat- 
ics 27(17):2457-2458 

Tonn MK, Thomas P, Barahona M et al (2019) 
Stochastic modelling reveals mechanisms of 
metabolic heterogeneity. Commun Biol 2:108 
Ebrahim A, Lerman JA, Palsson BO et al 
(2013) COBRApy: COnstraints-based recon- 
struction and analysis for python. BMC Syst 
Biol 7:74 

Dias O, Rocha M, Ferreira EC et al (2015) 
Reconstructing genome-scale metabolic mod- 
els with merlin. Nucleic Acids Res 43(8): 
3899-3910 

Gutierrez JM, Feizi A, Li S et al (2020) 
Genome-scale reconstructions of the mamma- 
lian secretory pathway predict metabolic costs 
and limitations of protein secretion. Nat Com- 
mun 11(1):68 


73. 


74. 


Hybrid Methods for Metabolic Pathway Modeling 


Yu JS, Bagheri N (2020) Agent-based models 
predict emergent behavior of heterogeneous 
cell populations in dynamic microenviron- 
ments. Front Bioeng Biotechnol 8:249 
Malik-Sheriff RS, Glont M, Nguyen TVN et al 
(2020) BioModels-15 years of sharing compu- 
tational models in life science. Nucleic Acids 
Res 48(D1):D407-D415 


75. 


76. 


439 


Wittig U, Rey M, Weidemann A et al (2018) 
SABIO-RK: an updated resource for manually 
curated biochemical reaction kinetics. Nucleic 
Acids Res 46(D1):D656-D660 

Flamholz A, Noor E, Bar-Even A et al (2012) 
eQuilibrator--the biochemical thermodynam- 
ics calculator. Nucleic Acids Res 40(Database 
issue):D770-D775 


Check for 
updates 


A Machine Learning-Based Approach Using Multi-omics 
Data to Predict Metabolic Pathways 


Vidya Niranjan, Akshay Uttarkar, Aakaanksha Kaul, 
and Maryanne Varghese 


Abstract 


The integrative method approaches are continuously evolving to provide accurate insights from the data 
that is received through experimentation on various biological systems. Multi-omics data can be integrated 
with predictive machine learning algorithms in order to provide results with high accuracy. This protocol 
chapter defines the steps required for the ML-multi-omics integration methods that are applied on 
biological datasets for its analysis and the visual interpretation of the results thus obtained. 


Key words Multi-omics, Machine learning, Integration, Algorithms, Unsupervised learning, 
Supervised learning 


1 Introduction 


In response to the vast amounts of omics data generated from high- 
throughput technologies, many integrated approaches have been 
sought out to aid in their analyses and visualization. Despite the fact 
that omics studies like metabolomics, lipidomics, and glycomics are 
not included in the core dogma analysis [1], they nonetheless 
provide a wealth of information about metabolites, lipids, and 
glycans (synthesized by the proteome via biosynthetic pathways) 
[2]. For example, because of its high sensitivity, high throughput, 
and unbiasedness, nontargeted metabolomics has attracted wide- 
spread attention as a method of profiling endogenous metabolites. 
As a result, metabolomics approaches have increasingly been 
applied to a variety of areas, including medication evaluation and 
monitoring [3]. By utilizing cutting-edge analytical technologies, 
metabolomics techniques are utilized to thoroughly analyze the 


Aakaanksha Kaul and Maryanne Varghese contributed equally with all other contributors. 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7_19, 
© The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 


441 


442 Vidya Niranjan et al. 


2 Methods 


metabolite composition of biological materials. In recent years, the 
liquid chromatography-mass spectrometry technological platform 
has become the most widely utilized analytical tool for metabolo- 
mics research [4]. Using tools like MetScape [5] and Mummichog 
[6], mapping of metabolites to pathways can be achieved. But the 
larger question that is required to be focused on is whether other 
“omics” data, for example, a proteomics data, can be integrated 
with metabolomics to achieve better accuracy in mapping results. 
For example, it can be seen that the integration of the RNA-Seq 
and the ChIP-Seq analyses on the data obtained from cell lines of 
head and neck squamous cell carcinoma (HNSCC) recognized the 
association between cancer-specific histone marks - H3K4me3 and 
H3K27ac — and transcriptional changes that are observed in the 
driver genes of HNSCC, epidermal growth factor receptor 
(EGFR), FGFRI, and FOXA1 [7]. Hence multi-omics has been 
proven significant over results obtained from single-omics data. 
The use of a multi-omics approach has led to the creation of a 
variety of tools, methods, and platforms for multi-omics data 
analysis. 

Machine learning (ML) methods in high-throughput multi- 
omics analyses have been gaining popularity in the recent decade 
due to the increased accessibility of high computing power. It can 
be used to interpret and visualize the data that is obtained in the 
present and use it to create an algorithm that can predict the results 
of datasets that may be studied in the future. This is used where 
there are challenges in designing definitive algorithms in a particu- 
lar problem set. The use of deep learning, a classification of machine 
learning, has gained momentum in cases of complex operations 
that are required to read the dataset and formulate predictions 
(see Notes 1 and 2). 

The superiority of integration approaches in grasping the com- 
plexity of diverse diseases and identifying the underlying anomalies 
from substantially generated multi-omics data, which is not always 
achievable with individual omics analysis, is highlighted by a num- 
ber of recent multi-omics research (see Note 3). 


The tools to be chosen for integration of a given multi-omics 
dataset with machine learning depend on the type of data, its 
quantum, integration method, and expected outcome. The data 
can be concatenated at an early stage, can be integrated at a late 
stage, or can be integrated as a transformation. With each of the 
previous options, supervised or unsupervised methods of machine 
learning can be used accordingly. An assemblage for the type of 
datasets which can be used (Fig. la) has been provided which 
includes all the possible types of biological datasets in any possible 
combination (Table 1). 


ML Approach in Multi Omics to Predict Metabolic Pathways 443 


Sse <=> Ese < 
Yes Yes 


1. Hierarchical | 


4. HE-GEN Forest 


5. Majority based 
voting 
n 
b thod 7 
Graph 
based 


iCluster Bayes r 

Mo Cluster 4.RVM 4 
2. IMKL-DR- Kemel 
3. SOP-SVM based 
Reduced 


1. MoroNet 


PSOF 
conexic Cluster 


Fig. 1 A schematic representation of recommended algorithms for multi-omics data integration and analysis. 
(a) Source of types of datasets which can be used for integration and analysis and in any combination. (b) The 
scheme for unsupervised ML method wherein based on the need to early or late data concatenation or 
transformation model the mentioned algorithms can be used. (c) For datasets with supervised learning, the 
scheme provides list of recommended algorithms which can be used for data integration, filtering, and 
clustering 


Table 1 
The list of tools recommended to be used for predicting metabolic pathways from multi-omics data 


SI 
no. Genomics Transcriptomics Proteomics Metabolomics Tools to be used 


1 Present Present Present Autoencoders, elastic net, SVM, and 
consensus clustering 


2) Present Present RE 
PLS-DA 
Extra trees 

3 Present Present Graphical RF 


4 Present Present SVM 


444 Vidya Niranjan et al. 


2.1 Early 
Concatenation of Data 


Supervised learning is a machine learning approach that uses 
labeled data to train models [8]. It is used to address problems 
involving regression and classification. Unsupervised learning is 
another machine learning approach that uses unlabeled input data 
to discover patterns. It is used to tackle problems involving associa- 
tion and clustering. A schematic representation of the recom- 
mended algorithms to be used for integration and data analysis 
for supervised learning is shown in Fig. 1b. Similarly, if the require- 
ment is to perform an unsupervised learning based on data avail- 
ability, a schematic representation (Fig. lc) recommends a list of 
algorithms which can be used along with data filtering and 
clustering. 


If the data is such that it can be concatenated at an early stage, the 
following directions and tools can be followed in terms of unsuper- 
vised and supervised methods (see Notes 4 and 5). 


Unsupervised: 


1. Check if the multi-omics dataset is overlapping. If there is a 
partial overlap, MOFA (multi-omics factor analysis) is used [9 ]. 


2. If the overlap is complete, check if there is a large dataset after 
integration. If yes, tools like moCluster [10], BN (Bayesian 
network) [11], LRAcluster [12], iClusterBayes [13], and 
even MOFA can be used. 


3. If the dataset doesn’t have a large dataset post-integration, 
check if it has negative values. If yes, use iCluster [14]. 


4. If all values are positive, then check if the dataset has different 
distributions. If it does, then use iCluster+ [15], JIVE, Joint 
and Individual Variation Explained [16], JBF, joint Bayes 
factor [17], or even tools from points 2 and 1. 


5. If the dataset has similar distributions, use joint NMF (non- 
negative matrix factorization) random forest [18]. 


Supervised: 


1. Check if a large dataset is produced after integration. If yes, 
either ensemble methods like LASSO (least absolute shrinkage 
and selection operator) [19] can be employed or filter (like 
information gain) or wrapper methods (like RFE) can be 
used. When using the latter, a reduced dataset is obtained 
which can be further dealt with using tools like DT (decision 
tree) [20], NB (naive Bayes) [21], SVM (support vector 
machine) [22], KNN (k-nearest machine) [23], K-Star [24], 
BT [25], SVR (support vector regression) [26], ANN (artificial 
neural network) [27], and DNN [28] (see Notes 6 and 7). 


2. Ifa smaller dataset is produced, the same tools that are used on 
the reduced dataset in the previous point can be used 


2.2 Concatenation of 
Data at Later a Stage 


2.3 Integration of 
Data as 
Transformation 


ML Approach in Multi Omics to Predict Metabolic Pathways 445 


If the data can be integrated at a later stage: 
Unsupervised: 


Check how many omics datasets there are to integrate. If the 
number is greater than 2, tools like FCA consensus clustering [29], 
MDI (multiple dataset integration) [30], BCC (Bayesian consensus 
clustering) [31], SNF (similarity network fusion) [32], PINS (per- 
turbation clustering for data integration and disease subtyping), 
and PINS+ are used (see Note 8): 


1. If the number is 2 itself, then check if one of the omics is 
gene expression. If not, use tools mentioned in point 1. 


2. If yes, check if the other omics is copy number variation 
(CNV) data [33]. If yes, tools like PSDF (patient-specific 
data fusion) [34], LemonTree [35], and CONEXIC [36] 
are used. If not, just LemonTree or the tools mentioned in 
point 1 can be used. 


Supervised: 


Tools like majority-based voting [37], hierarchical classifiers 
[38], ensemble-based (XGBoost [39] and KNN [40]), MOLI 
(multi-omics late integration) [41], Hi-DFN Forest [42], 
ATHENA (Analysis Tool for Heritable and Environmental Net- 
work Associations) [43], and autoencoder-based classifiers can 
be used. 


If the dataset can be integrated as a transformation: 
Unsupervised: 


1. Check if the multi-omics datasets are overlapping. If the over- 
lap is partial, NEMO (neighborhood-based multi-omics clus- 
tering) [44] can be used. 


2. If overlap is complete, tools like rMKL-LPP, regularized mul- 
tiple kernel learning for locality preserving projections [45], 
PAMOGK (pathway graph kernel-based multi-omics cluster- 
ing approach) [46], Meta-SVM [47], and NEMO are used. 


Supervised: 


Check if it is a kernel- or graph-based transformation. If it is a 
kernel-based transformation, tools like SDP-SVM [48], FSMKL, 
multiple kernel learning for feature selection [49 ], RVM (relevance 
vector machine) [50, 51], AdaBoost RVM [52], and fMKL-DR 
[53] are used. 


If it is a graph-based transformation, tools like SSL (semi- 
supervised learning) [54-58 ], graph sharpening [59, 60], compos- 
ite network [61], Bayesian network [62], and MORONET (Multi- 
Omics gRaph cOnvolutional NETworks) [63] are used. 


446 Vidya Niranjan et al. 


3 Application 


3.1 Supervised 
Learning 


3.1.1 Test Case: Deep 
Learning-Based Multi- 
omics Integration [64] 


Tools to be used for analysis via integration of metabolomics 
data/datasets with other types of commonly available or gener- 
ated data is consolidated in Table 1 (see Notes 9-13). 


Hepatocellular carcinoma (HCC) is the most common kind of liver 
cancer (70-90%), and establishing robust survival subgroups will 
improve patient care dramatically. There is a paucity of research that 
takes into account the high level of heterogeneity and integrates 
multi-omics data to explicitly predict HCC survival from various 
patient cohorts. To close this gap, the paper used 15,629 genes 
from RNA-Seq, 365 miRNAs from miRNA-Seq, and 19,883 genes 
from DNA methylation data from The Cancer Genome Atlas 
(TCGA) as input characteristics for 360 samples to build the 
DL-based, survival-sensitive model, which predicts prognosis as 
well as an alternative model that considers both genomics and 
clinical data. The study employs a total of six cohorts that reliably 
distinguishes patient survival subpopulations in the investigation, 
and their descriptions are listed below: 


e The TCGA data was used in two steps: The first step is to use 
the entire TCGA dataset to obtain the labels of survival risk 
classes; the second is to train a support vector machine 
(SVM) model by splitting the samples 60/40 between train- 
ing and held-out testing data. 


e To assess the DL-based prognosis model’s prediction accu- 
racy, the study used five additional confirmation datasets. 


e The TCGA portal provided the multi-omics HCC data. 


¢ Ifthe biological traits had no value in more than 20% of the 
patients, it was deleted. 


e If more than 20% of the characteristics were missing, the 
samples were deleted. 


e To fill in the missing values, the study utilized the impute 
function from the R impute package [65 ]. 


e Across all samples, input characteristics with zero values 
were deleted. 


Using an autoencoder, a DL framework [66], the three types of 
omics characteristics were stacked together. Each of the 100 fea- 
tures was subjected to univariate Cox PH regression, and 37 of 
them were found to be substantially linked with survival. K-means 
clustering was used to group these 37 traits, with cluster number K 
ranging from 2 to 6. For the ensuing supervised machine learning 
operations, it was established that K 142 was the ideal number of 
classes. 


3.2 Unsupervised 
Learning 


3.2.1 Test Case: 
Multidata Integration of 
Metabolomics and 
Transcriptomics to Reveal 
the Modulation Network of 
Cell Regulation [67] 


4 Notes 


ML Approach in Multi Omics to Predict Metabolic Pathways 447 


Cells employ multiple levels of regulation, including transcriptional 
and translational regulation, that drive core biological processes 
and enable cells to respond to genetic and environmental changes. 
Small-molecule metabolites are one category of critical cellular 
intermediates that can influence as well as be a target of cellular 
regulations. Because metabolites represent the direct output of 
protein-mediated cellular processes, endogenous metabolite con- 
centrations can closely reflect cellular physiological states, especially 
when integrated with other molecular-profiling data. In this partic- 
ular case study, a network reconstruction approach simultaneously 
integrates six different types of data, endogenous metabolite con- 
centration, RNA expression, DNA variation, DNA-protein bind- 
ing, protein-metabolite interaction, and _ protein-protein 
interaction data, to construct probabilistic causal networks that 
elucidate the complexity of cell regulation in a segregating yeast 
population. 

Two classes of data were employed to reconstruct probabilistic 
causal networks: (1) DNA variation, gene expression, and metabo- 
lite data measured in the BXR cross (referred to here as BXR data) 
and (2) protein-DNA binding, protein-protein interaction, and 
metabolite-protein interaction data available from public data 
sources and generated independently of the BXR cross (referred 
to here as non-BXR data). The BXR data are reflected as nodes in 
the network, where edges in the network reflect statistically inferred 
causal relationships among the expression and metabolite traits. 


1. The intricacy of multi-omics data processing necessitates col- 
laboration between the clinical and machine learning commu- 
nities, as well as the use of approaches from many fields. We 
found that some promising methods, such as matrix factoriza- 
tion, have not been widely used, whereas clustering and 
network-based approaches have been widely used, owing to 
their flexibility and ability to be integrated into comprehensive 
frameworks that include feature extraction and transformation 
to overcome the dimensionality curse [68]. 


2. Other types of noise and filtering methods’ impact on omics 
integration should be investigated in the future. Molecular 
pathways [69], biomarkers [70], and sample categorization 
[71] have all been discovered via multi-omics integration [15]. 


3. In order to reduce their complexities and heterogeneities and 
facilitate their subsequent integration and analysis, most inte- 
gration algorithms created in recent years prefer to initially 


448 


Vidya Niranjan et al. 


10. 


11. 


change and transform each dataset using multiple machine 
learning models, and this is known as mixed integration [8]. 


. Although early and intermediate integration strategies solve 


this problem by integrating all datasets ahead of time, the 
large matrix generated by early integration is hard for most 
ML models to exploit, and intermediate integration often 
depends on unsupervised matrix factorization, which has diffi- 
culty incorporating the substantial amount of preexisting 
biological knowledge [8]. 


. Ifthe model is not designed for a specific purpose or for specific 


multi-omics datasets, there are chances of it performing poorly 
[72, 73]. Massive matrices, outliers, highly correlated variables, 
noise, and other difficulties are worsened in multi-omics inves- 
tigations, and some models can’t manage them [74]. It may 
also be a possibility that the omics are not integrated 


properly [9]. 


. As some omics will contain less or no useful information, the 


complementarity of datasets and their relative pertinence 
should be taken into account [10]. 


. It is still challenging to translate DL model variable weights 


into a context that domain specialists can understand [75 ]. Net- 
work mapping [11], in which statistical, functional, and 
ML-based outputs are transferred onto network manifolds 
(similarity, biochemical, and empirical), is required to be 
adapted for the layered DL feature space. 


. Identifying causation in complex phenotypes currently makes 


specialized analysis and domain expert interpretation necessary; 
however, in the future, medical data accessibility, quality, and 
scale may enable near-automated DL-based detection of many 
clinically relevant events [12]. 


Current ERM-based machine learning methods have some 
limitations, such as identifying the causal relationships between 
variables [76]. The learning algorithm seeks to absorb all of the 
association links (e.g., correlation) identified in the data when 
minimizing empirical error [13]. To solve the association- 
versus-cause conundrum, the invariant risk minimization 
(IRM) theoretical framework for learning causations by infer- 
ring invariances across conditions (e.g., different omics in 
biological context) was presented [14]. 


For machine learning applications, biological interpretability 
remains a hurdle [77]. Previous work like, for example, inter- 
pretable deep neural network modeling [16], incorporated 
biological knowledge into the machine learning model for 
underlying mechanisms to address this [78 ]. 


When working on rare events, such as an uncommon attribute 
in a population, class imbalance develops when the distribution 


5 Summary 


References 


ML Approach in Multi Omics to Predict Metabolic Pathways 449 


of classes in the learning data is skewed, which can be a serious 
problem [79]. This problem can be solved using a variety of 
strategies [15], including sampling and cost-sensitive learning. 


12. Missing data can take numerous forms, from missing values in 
variables to missing omics data in samples. If enough samples 
are available, listwise elimination of the rows with missing data 
may be acceptable [10]. If not, a variety of statistical methods 
can be employed to fill in the blanks. [76]. 


13. A normal gene expression dataset may comprise tens of 
thousands of variables compared to a metabolomics dataset 
that may have only a few thousand. Disparities between omics 
can make it difficult to integrate them and cause an imbalance 
in the learning process. The various integration solutions 
described each take a different approach to these issues, such 
as lowering the number of variables, changing the input data 
into a more usable representation, integrating toward the end 
of the study, and so on [76]. 


In the recent years, a lot of research has been carried out in various 
cancer studies and plant omics study. While the scope of multi- 
omics approach is of high impact with applications in personalized 
healthcare in the near future, the tools and approaches using 
machine learning methods need to be more user-friendly. The 
challenges faced during the analysis of multi-omics data can be 
heterogenicity of the data which is generated from different 
sources. The different scaling and normalization techniques used 
lead to varied dynamic ranges of values. This can lead to change in 
threshold values and outliers. So, it is preferable to analyze each of 
the omics data on its merit before the ideation of multi-omics 
approach. 

The protocol chapter briefs and provides an overview on the 
various machine learning methods and approaches used in multi- 
omics analysis. The professionals can choose an appropriate method 
based on the requirements for the multi-omics dataset. 


1. Cobb M (2017) 60 years ago, Francis crick 3. Surowiec I, Karimpour M, Gouveia-Figueira S$ 
changed the logic of biology. PLoS Biol et al (2016) Multi-platform metabolomics 
15(9):e2003243-e2003243 assays for human lung lavage fluids in an air 

2. Reel PS, Reel S, Pearson E et al (2021) Using pollution exposure study. Anal Bioanal Chem 
machine learning approaches for multi-omics 408(17):475 14764 
data analysis: a review. Biotechnol Adv 49: 4. Wei Z, Xi J, Gao S et al (2018) Metabolomics 


107739 


coupled with pathway analysis characterizes 
metabolic changes in response to BDE-3 


450 


10. 


ll. 


12. 


13. 


14. 


15. 


16. 


Ls. 


Vidya Niranjan et al. 


induced reproductive toxicity in mice. Sci Rep 
8(1):5423-5423 


. Karnovsky A, Weymouth T, Hull T et al (2012) 


Metscape 2 bioinformatics tool for the analysis 
and visualization of metabolomics and gene 
expression data. Bioinformatics (Oxford, Eng- 
land) 28(3):373-380 


. Li S, Park Y, Duraisingham S et al (2013) Pre- 


dicting network activity from high throughput 
metabolomics. PLoS Comput Biol 9(7): 
e1003123-e1003123 


. Chakraborty S, Hosen MI, Ahmed M et al 


(2018) Onco-multi-OMICS approach: a new 
frontier in cancer research. Biomed Res Int 
2018:9836256-9836256 


. Sathya R, Abraham A (2013) Comparison of 


supervised and unsupervised learning algo- 
rithms for pattern classification. Int J Adv Res 
Artif Intell 2(2) 


. Argelaguet R, Velten B, Arnol D et al (2018) 


Multi-omics factor analysis-a framework for 
unsupervised integration of multi-omics data 
sets. Mol Syst Biol 14(6):e8124-e8124 


Meng C, Helm D, Frejno M et al (2015) 
moCluster: identifying joint patterns across 
multiple omics data sets. J Proteome Res 
15(3):755-765 

Fridley BL, Lund S, Jenkins GD et al (2012) A 
Bayesian integrative genomic model for path- 
way analysis of complex traits. Genet Epidemiol 
36(4):352-359 

Wu D, Wang D, Zhang MQ et al (2015) Fast 
dimension reduction and integrative clustering 
of multi-omics data using low-rank approxima- 
tion: application to cancer molecular classifica- 
tion. BMC Genomics 16:1022-1022 

Shen R, Olshen AB, Ladanyi M (2009) Inte- 
grative clustering of multiple genomic data 
types using a joint latent variable model with 
application to breast and lung cancer subtype 
analysis. Bioinformatics (Oxford, England) 
25(22):2906-2912 

Raftopoulou P, Petrakis EGM iCluster: A self- 
organizing overlay network for P2P informa- 
tion retrieval. Lecture notes in computer sci- 
ence. Springer, Berlin/Heidelberg, pp 65-76 
Subramanian I, Verma S, Kumar S et al (2020) 
Multi-omics data integration, interpretation, 
and its application. Bioinform Biol Insights 
14: 
1177932219899051-1177932219899051 
Lock EF, Hoadley KA, Marron JS et al (2013) 
Joint and individual variation explained (jive) 
for integrated analysis of multiple data types. 
Ann Appl Stat 7(1):523-542 

Ray P, Zheng L, Lucas J et al (2014) Bayesian 
joint analysis of heterogeneous genomics data. 
Bioinformatics 30(10):1370-1376 


18. 


19. 


20 


21. 


22. 


23. 


24. 


25. 


26. 


27: 


28. 


29. 


30. 


31. 


32. 


33. 


Zhang S, Liu C-C, Li Wet al (2012) Discovery 
of multi-dimensional modules by integrative 
analysis of cancer genomic data. Nucleic Acids 
Res 40(19):9379-9391 

Zou H (2006) The adaptive lasso and its oracle 
properties. J Am Stat Assoc 101(476): 
1418-1429 


.Salzberg SL (1994) C4.5: programs for 


machine learning by J. Ross Quinlan. Morgan 
Kaufmann Publishers, Inc., 1993. Mach Learn 
16(3):235-240 

Domingos P, Pazzani M (1997) Mach Learn 
29(2/3):103-130 

Vapnik VN (2000) Direct methods in statistical 
learning theory. The nature of statistical 
learning theory. Springer, New York, pp 
225-265 

Altman NS (1992) An introduction to kernel 
and nearest-neighbor nonparametric regres- 
sion. Am Stat 46(3):175-185 

Cleary JG, Trigg LE (1995) K*: an instance- 
based learner using an entropic distance mea- 
sure. Machine learning proceedings 1995. 
Elsevier, pp 108-114 

Elith J, Leathwick JR, Hastie T (2008) A work- 
ing guide to boosted regression trees. J Anim 
Ecol 77(4):802-813 

Awad M, Khanna R (2015) Efficient learning 
machines. Apress 

Van Dyke Parunak H (1998) Book review: 
neural networks for pattern recognition by 
Christopher M. Bishop (Clarendon Press, 
1995). ACM SIGART Bull 9(1):41-43 

Tang B, Pan Z, Yin K et al (2019) Recent 
advances of deep learning in bioinformatics 
and computational biology. Front Genet 10: 
214-214 

Hristoskova A, Boeva V, Tsiporkova E (2014) 
A formal concept analysis approach to consen- 
sus clustering of multi-experiment expression 
data. BMC Bioinform 15:151-151 

Kirk P, Griffin JE, Savage RS et al (2012) 
Bayesian correlated clustering to integrate mul- 
tiple datasets. Bioinformatics (Oxford, Eng- 
land) 28(24):3290-3297 

Lock EF, Dunson DB (2013) Bayesian consen- 
sus clustering. Bioinformatics (Oxford, Eng- 
land) 29(20):2610-2616 

Wang B, Mezlini AM, Demir F et al (2014) 
Similarity network fusion for aggregating data 
types on a genomic scale. Nat Methods 11(3): 
333-337 

Freeman JL, Perry GH, Feuk L et al (2006) 
Copy number variation: new insights in 
genome diversity. Genome Res 16(8):949-961 


. Yuan Y, Savage RS, Markowetz F (2011) 


Patient-specific data fusion defines prognostic 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


42. 


43. 


44. 


45. 


46. 


ML Approach in Multi Omics to Predict Metabolic Pathways 


cancer subtypes. PLoS Comput Biol 7(10): 
e1002227-e1002227 


Bonnet E, Calzone L, Michoel T (2015) Inte- 
grative multi-omics module network inference 
with lemon-tree. PLoS Comput Biol 11(2): 
€1003983-1003983 

Akavia UD, Litvin O, Kim J, et al (2009) 
Abstract B70: conexic: a Bayesian framework 
to detect drivers and their function uncovers an 
endosomal signature in melanoma. Poster pre- 
sentations — proffered abstracts, American 
Association for Cancer Research 

Draghici S, Potter RB (2003) Predicting HIV 
drug resistance with neural networks. Bioinfor- 
matics 19(1):98-107 

Bavafaye Haghighi E, Knudsen M, Elmedal 
Laursen B et al (2019) Hierarchical classifica- 
tion of cancers of unknown primary using 
multi-omics data. Cancer Informat 18: 
1176935119872163-1176935119872163 
Ma A, McDermaid A, Xu J et al (2020) Inte- 
grative methods and practical challenges for 
single-cell multi-omics. Trends Biotechnol 
38(9):1007-1022 

Shen HB, Chou KC (2006) Ensemble classifier 
for protein fold pattern recognition. Bioinfor- 
matics 22(14):1717-1722 

Sharifi- Noghabi H, Zolotareva O, Collins CC 
et al (2019) MOLI: multi-omics late integra- 
tion with deep neural networks for drug 
response prediction. Bioinformatics (Oxford, 
England) 35(14):1501-i509 

Xu J, Wu P, Chen Y et al (2019) A hierarchical 
integration deep flexible neural forest frame- 
work for cancer subtype classification by inte- 
grating multi-omics data. BMC bioinformatics 
20(1):527-527 

Chung R-H, Kang C-Y (2019) A multi-omics 
data simulator for complex disease studies and 
its application to evaluate multi-omics data 
analysis methods for disease classification. 
GigaScience 8(5):giz045 

Rappoport N, Shamir R (2019) NEMO: can- 
cer subtyping by integration of partial multi- 
omic data. Bioinformatics (Oxford, England) 
35(18):3348-3356 

Speicher NK, Pfeifer N (2015) Integrating dif- 
ferent data types by regularized unsupervised 
multiple kernel learning with application to 
cancer subtype discovery. Bioinformatics 
(Oxford, England) 31(12):i268-i275 

Tepeli YI, Unal AB, Akdemir EM et al (2019) 
PAMOGK: a pathway graph kernel based 
multi-omics clustering approach for discover- 
ing cancer patient subgroups. Cold Spring 
Harbor, Laboratory 


47. 


48 


49. 


50. 


51. 


52. 


53. 


54. 


55. 


56. 


57. 


58. 


59. 


451 


Kim S, Jhong J-H, Lee J et al (2017) Meta- 
analytic support vector machine for integrating 
multiple omics data. BioData mining 10:2—2 


. Lanckriet GRG, De Bie T, Cristianini N et al 


(2004) A statistical framework for genomic 
data fusion. Bioinformatics 20(16):2626-2635 


Seoane JA, Day INM, Gaunt TR et al (2014) A 
pathway-based data integration framework for 
prediction of disease progression. Bioinformat- 
ics (Oxford, England) 30(6):838-845 

Bowd C, Medeiros FA, Zhang Z et al (2005) 
Relevance vector machine and support vector 
machine classifier analysis of scanning laser 
polarimetry retinal nerve fiber layer measure- 
ments. Invest Ophthalmol Vis Sci 46(4): 
1322-1329 


Zhou Y, Kantarcioglu M, Thuraisingham B 
(2012) Sparse Bayesian adversarial learning 
using relevance vector machine ensembles. 
2012 IEEE 12th international conference on 
data mining. IEEE 


Wu C-C, Asgharzadeh S, Triche TJ et al (2010) 
Prediction of human functional genetic net- 
works from heterogeneous data using 
RVM-based ensemble learning. Bioinformatics 
(Oxford, England) 26(6):807-813 

Giang T-T, Nguyen T-P, Tran D-H (2020) 
Stratifying patients using fast multiple kernel 
learning framework: case studies of Alzheimer’s 
disease and cancers. BMC Med Inform Decis 
Mak 20(1):108-108 

Tsuda K, Shin H, Scholkopf B (2005) Fast 
protein classification with multiple networks. 
Bioinformatics 21(Suppl 2):ii59-1165 

Culp M, Michailidis G (2008) Graph-based 
semisupervised learning. IEEE Trans Pattern 
Anal Mach Intell 30(1):174-179 

Kim D, Joung J-G, Sohn K-A et al (2015) 
Knowledge boosting: a graph-based integra- 
tion approach with multi-omics data and geno- 
mic knowledge for cancer clinical outcome 
prediction. J Am Med Inform Assoc: JAMIA 
22(1):109-120 

Bhardwaj A, Van Steen K (2020) Multi-omics 
data and analytics integration in ovarian cancer. 
IFIP Advances in Information and Communi- 
cation Technology, Springer International 
Publishing, pp 347-357 

Yue Z, Meng D, He J et al (2017) Semi- 
supervised learning through adaptive Laplacian 
graph trimming. Image Vis Comput 60:38-47 
Shin H, Lisewski AM, Lichtarge O (2007) 
Graph sharpening plus graph integration: a 
synergy that improves protein functional classi- 
fication. Bioinformatics 23(23):3217-3224 


452 


60. 


6l. 


62. 


63. 


64. 


65. 


66. 


67. 


68. 


69. 


Vidya Niranjan et al. 


Shin H, Hill NJ, Lisewski AM et al (2010) 
Graph sharpening. Expert Syst Appl 37(12): 
7870-7879 


Mostafavi S, Morris Q (2010) Fast integration 
of heterogeneous data sources for predicting 
gene function with limited annotation. Bioin- 
formatics (Oxford, England)  26(14): 
1759-1765 

Rhodes DR, Tomlins SA, Varambally S et al 
(2005) Probabilistic model of the human 
protein-protein interaction network. Nat Bio- 
technol 23(8):951-959 


Wang T, Shao W, Huang Z et al (2020) MOR- 
ONET: multi-omics integration via graph con- 
volutional networks for biomedical data 
classification. Cold Spring Harbor, Laboratory 


Chaudhary K, Poirion OB, Lu L et al (2018) 
Deep learning-based multi-omics integration 
robustly predicts survival in liver cancer. Clin 
Cancer Res 24(6):1248-1259 

Xiang Q, Dai X (2008) Improving missing 
value imputation in microarray data by using 
gene regulatory information. 2008 2nd inter- 
national conference on bioinformatics and bio- 
medical engineering. IEEE 

Bengio Y (2009) Learning deep architectures 
for AI. Now Publishers Inc. 


Zhu J, Sova P, Xu Q et al (2012) Stitching 
together multiple data dimensions reveals 
interacting metabolomic and_ transcriptomic 
networks that modulate cell regulation. PLoS 
Biol 10(4):e1001301-e1001301 

Liu W, Ma S, Feny6 D (2017) Pathway-level 
integration of proteogenomic data in breast 
cancer using independent component analysis. 
Cold Spring Harbor, Laboratory 

Kaplan A, Lock EF (2017) Prediction with 
dimension reduction of multiple molecular 
data sources for patient survival. Cancer Infor- 
mat 16: 
1176935117718517-1176935117718517 


70. 


71. 


72. 


73: 


74. 


75. 


76. 


77. 


78. 


79. 


Grapov D, Wanichthanarak K, Fiehn O (2015) 
MetaMapR: pathway independent metabolo- 
mic network analysis incorporating unknowns. 
Bioinformatics (Oxford, England) 31(16): 
2757-2760 

Grapov D, Fahrmann J, Wanichthanarak K et al 
(2018) Rise of deep learning for genomic, pro- 
teomic, and metabolomic data integration in 
precision medicine. Omics: J Integr Biol 
22(10):630-636 

Nguyen ND, Wang D (2020) Multiview 
learning for understanding functional multio- 
mics. PLoS Comput Biol 16(4): 
e€1007677-e1007677 

Arjovsky M, Bottou L, Gulrajani I et al (2019) 
Invariant risk minimization. arXiv:1907.02893 
Ma J, Yu MK, Fong S et al (2018) Using deep 
learning to model the hierarchical structure 
and function of a cell. Nat Methods 15(4): 
290-298 

Tini G, Marchetti L, Priami C et al (2017) 
Multi-omics integration—a comparison of 
unsupervised clustering methodologies. Brief 
Bioinform 20(4):1269-1279 

Picard M, Scott-Boyer M-P, Bodein A et al 
(2021) Integration strategies of multi-omics 
data for machine learning analysis. Comput 
Struct Biotechnol J 19:3735-3746 

Nicora G, Vitali F, Dagliati A et al (2020) 
Integrated multi-omics analyses in oncology: a 
review of machine learning methods and tools. 
Front Oncol 10:1030—1030 

Glass kK, Huttenhower C, Quackenbush J et al 
(2013) Passing messages between biological 
networks to refine predicted interactions. 
PLoS One 8(5):e64832-e64832 

Wahl S, Vogt S, Stiickler F et al (2015) Multi- 
omic signature of body weight change: results 
from a population-based cohort study. BMC 
Med 13:48-48 


INDEX 


A 

Alporithinis 2c ile. 22-24, 26, 29, 31, 
32 42, 60-65, 81, 83, 96, 98, 99, 102, 104, 106, 
111, 162-165, 179, 196, 201, 204, 212, 222, 
229, 271, 286, 288-293, 295-298, 300, 301, 
304-309, 311, 312, 315, 327, 328, 330-334, 
336, 351, 354-365, 369, 373-376, 379, 383, 
384, 386, 389, 390, 397-399, 401, 403, 405, 
406, 408, 419, 424, 430, 434, 442-444, 448 

Angiogenesis........... vi, 275-282 

Automatized design v....cccssssccsssiseresssacczssnsscrsesteoneweses 163 

B 

Bioinformatics .........cceeceeeeeececeeeeeeeeeeeeeeeees v, 4, 16, 79-92, 
223 224, 226, 230-232, 234, 248-255, 259, 
261, 265-273, 286, 301, 302, 315 

BlOPLOCESS oo. e eee ee eee eeeeeeeeeeeeee vi, 6, 33, 173-187 

BiOStatistics x. ccce; mais csrepsenyecsnsecvenssalioninr sa civenepeesvener teats 261 

Biotechnology: is.ssssissossssdscesecsedescrentvsvesecavsssaneisoss v, 7, 8, 11, 
17 29, 209, 222, 418, 422 

Boolean ‘gates ey 5 sauieensomens 121-153 

Breast CANCEL... .eecsecsccsecsseseesecesceseesceseeseeatene 270, 325-390 

C 

CaN GOP sy. ccscsg ters arsiorieestamenenenagieinns eet vi, 273, 277, 279, 
325-327 330, 342, 384, 390, 422, 446, 449 

Cell factory wees 2, 14, 18, 173, 174, 179 

Cell kinetic model .......cccccccccccesseseeeeeeeeee 174, 175, 184 

Cell Sorting sy, ceca cncevsestteniseenecenenes v, 42, 44, 48, 49, 445 

Compartmentalization 190, 191, 204, 424 

Computational fluid dynamics «00.0.0. eee vi, 174, 175 

Convergent promoters o.0..... cece eee vy, 121-153 

D 

Diatabases ..:....c.cccccessstsscesnsstesnessessevarssisserensess vi, 22, 27, 30, 


60-63 80, 85-87, 90, 92, 98, 100, 102, 118, 158, 
166-169, 210-213, 225, 231, 232, 235, 255, 
260, 265-273, 285, 286, 290-292, 294, 296, 
299, 302, 304, 313, 402, 421,422,427, 431,432 
Data integration 310, 398, 433, 434, 443, 445 
Deep learning... eee 23, 26, 27, 62, 72,79, 
285-315 390, 397, 398, 426, 442, 446-447 


Diffusion 0.0... ccc 
Digital circviit....c.iscssinissstsocessesaeeveortss 
Directed evolution . 


E 
Experimental design.......... 31, 42, 48, 49, 173-187, 199 
F 
Feature engimeering ..............scsssseeeseseeeeeeeees 314, 402-406 
G 
Generative MOdel oo... eeeteteeeeeeeeeees 96, 106-115 


Genetic stability 


Genome mining 


H 

High-throughput .....0... cece v, 2, 3,27, 223, 
266 272, 291, 292, 301, 305, 308-310, 314, 
326, 396, 428, 430, 441, 442 

History-dependent 0.0... eeeseseesseeeeeeeeeteeeeeeseees 157, 158, 
160-162 164-166, 169 

Hybrid modeling... eeeeeeeeeeeees 419, 420, 428 

I 

Improved light-induced dimerization 
(GLUD YY, SyStOM sasnccsysvisegsesdvsnsensdusveess 190, 192, 193, 
195 196, 198, 199, 201-206 

Industrial fermentation ............ceeseeeeeeeeteeeeeeeeeees 4,5, 12 

TPEPEN CEs jes oe chs Si eecstesisavcasnth contedao eeseageouseancs 42, 46-50, 52, 
176 224, 306, 307, 426 

Integrated modelling ............ vi, 174, 175, 179, 183-186 

Integration oy eee sas arene cee 12-16, 18, 23, 


204 265, 276, 295, 301, 304, 306, 310, 314, 
315, 326, 335, 387, 390, 398, 424, 425, 433, 
434, 442-449 

Tite ractiont .é.csje5042 avccsvcciacssexeyayoneet vi, 24, 28, 57, 58, 61, 
63-64 75, 92, 96, 106, 108, 131, 133, 149, 151, 
177, 181, 182, 193, 194, 198, 199, 203, 205, 
231, 251, 259, 267, 273, 285-315, 420, 429, 
433, 447 

Tniterpretability sccsccccsiscerstevscasicericmseretoons 308, 312, 314, 
328 363, 364, 389, 449 


Kumar Selvarajoo (ed.), Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology, 
Methods in Molecular Biology, vol. 2553, https://doi.org/10.1007/978-1-0716-2617-7, 
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer 


Nature 2022 


453 


454 CompPUTATIONAL BIOLOGY AND MACHINE LEARNING FOR METABOLIC ENGINEERING AND SYNTHETIC BIOLOGY 


Index 
K 
KEGG ‘classes cides cduasttecisntineidecnen 30, 210-213, 
267-272 402 
L 
TipidOmics sc..yseccsteescsttceitvetecnenes 423, 428-432, 441 
To CaliZation scsscht cic etesstisdstiaetatadchepene decrees 189, 190, 192, 
211 213, 226, 231, 251, 256, 299, 310, 314, 424 
TOBA sete deni iasecedveddenavcedeacnupesteaictvezaed v, 2, 3,5, 7-11, 13, 
14 16, 121-124, 126-128, 130-133, 136, 142, 
150, 151, 155-169, 422 
M 
Machine learning... eeeeeeeeeteeeeeeees v, vi, 4, 21-34, 
42 59, 273, 289-291, 301, 303, 305-309, 
312-315, 325-390, 395-414, 417-436, 441-449 
Markov random field w.....cceeeiceseseeeeeeeeteteeeeeeees 98, 108 
Massively parallel reporter assay 
CIMIRRAS)) res cece Pascoe aves estes tease laate abt oars v, 41-54 
Mathematical modeling... ce eeeeeeeeteeees vi, 18, 31, 
137 142, 191, 213, 420 
Metabolic engineering... eeeeeeteteees v, vi, 1-18, 
23 28-33, 159, 209-218, 221, 261, 265-273 
Metabolism modeling... eee 418, 420, 422, 
424-427 430, 431 
Metabolomics ..........::ccceceseseseeseeeeseeeeseeeeeees 222, 272, 326, 
395-414 422, 423, 428, 429, 431-433, 
441-443, 446, 447, 449 
Molecular assembly .........ccceeeeeeseseeeeeeseeeeeeeeeeeteees 57-75 
Multicellular consortia... ceeccceseeeeeeeeeeeeeeeeeeeeeees 158 
Miulti=Omicss. 428 ect een ue ees 31, 222, 314, 
326 327, 441-449 
Multiple sequence alignments (MSA) ......... ee 62, 63, 
68 71, 73, 82, 87, 90, 91, 96-100, 102-106, 108, 
110, 112, 232, 255, 259 
N 
INetiral:nctwOrks 4... 2::2.seieaat itd oat tes 23-26, 28, 
29 32, 42, 83, 289, 290, 293, 300, 308-310, 312, 
395-414, 425, 426, 435 
O 
OmMiEs asd ta ned Anes vi, 266, 272, 286, 303, 
314 326, 327, 387, 398, 420, 421, 424, 425, 
427, 431433, 441, 442, 445, 447-449 
Optogenetics oes vi, 190-192, 201, 204 
Overlapping genes 
P 


Performance: Metrics -2..:c2sscsessccestesanceeestess 397, 434 


Pharmacogenomics .........cceeeseeeseeteteeeeeeeeseeeeeees 266, 270 
Plasma membrane ......... eee vi, 189-192, 199, 202 
1,3-propanediol 0... eeeseeeeeseessteeeeeeeeseneneeeeees 209-218 


Protein design ..... 


Protein dynamics 


Protein engineering «0.0... ee v, 8, 24-29, 79-92 
Protest. NEEWOLKS sce saseariesseredaeereeeh oe eheteontaats 271, 310 
Protein-protein docking... 64-66, 70-74 
Protein-protein interaction (PPI) oe 57, 58, 


63-64 75, 226, 231, 234, 251, 254, 256, 259, 
291, 292, 295-297, 302, 303, 306-308, 313, 447 


Protein recruitment 0... 189-206 
Protein structure prediction ........... 59-63, 65, 81-83, 85 
R 
Recombinase 20.0.2... cccceccecesceseeeeseeceseeeseeseseeeeeeees v, 155-169 
RNA polymerase IT collision 123, 136, 
140 144-151 
RNA sequencing (RNA-Seq) .....ceeeeeeeeeeeeeee 223-225, 
232 233, 235, 261, 266, 297, 298, 301, 310, 
442, 446 
Ss 
Sequence COeVOLUtION oo... eee eeteteeeeeeeeeeteeeeeeeeeeenees 60 
Sequencing oo... 29, 42, 44-46, 48-50, 52, 79, 
174 222, 226, 266, 267, 272, 285, 301-303, 326 
Signal transduction ..0.....c eee seseeeeeeeeeeteneeees 83, 190 
Simulation 4 sycc.hecasnicnensnces 22, 43-46, 48, 49, 
51 53, 54, 61, 69, 83, 113, 147, 149, 153, 175, 
179, 183-186, 197, 202, 203, 205, 213, 218, 
275-282, 418, 420, 425, 427, 433 
Single-cell... 159, 160, 162, 163, 166, 169 
Small molecule docking... cece eeneeeeeeeees 83, 88 
Software teaiiaucinusuninaaced vi, 31, 43, 54, 65, 80, 
82 83, 85, 86, 91, 92, 98-99, 117, 153, 155, 158, 
169, 210-213, 225, 248, 260, 265-273, 280, 
288, 313, 336, 338, 405, 422, 427, 428 
Speciality chemicals... eceeeeeeeeeeeeeeeeenees 5, 8,10, 11 
Structural bioinformatics ........ ccc eeeeeeeeeteeeeeeeeeees 79-92 
Supervised learning ......ceceeeeeeseeeeeeseeteneeeeeees 23, 306, 
333 397, 443, 444, 446-447 
Survival analysis :..c9 cs sesiscesvesccasasanedecteaace essen deves 325-390 


Synthetic biology v, vi, 2,5, 7, 
8 13, 14, 21-34, 58, 95, 96, 123, 156, 160, 
189-206, 221-261 


Synthetic enOMES .....e eee seeeeeeeeeeeeteeeeeeeeeeeeeeeeeeeeeeeees 287 
Systems biology... v, 4, 18, 222, 231, 301, 420 
Systems metabolic engineering «00.0.0... 209-218 
T 

TUMOB eke eens vi, 275-282, 327, 335, 346 


CompPuTATIONAL BioLocy AND MAacHINE LEARNING FOR METABOLIC ENGINEERING AND SYNTHETIC BIOLOGY 455 
Index 


U Ww 


Unsupervised learning ........0 eee 23, 229,306, . © Web-intertace® sists seisssasivissscevscsssatigedesssvelecasivess 161-169 
397 444, 447 


Vv 


Vibrio natriegens, vi, 209-218 


